strictly proper scoring rules, prediction, and estimation...a predictive distribution (bernardo and...

20
Strictly Proper Scoring Rules, Prediction, and Estimation Tilmann GNEITING and Adrian E. RAFTERY Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to cross-validation, and propose a novel form of cross-validation known as random-fold cross-validation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile estimation, and propose the intuitively appealing interval score as a utility function in interval estimation that addresses width as well as coverage. KEY WORDS: Bayes factor; Bregman divergence; Brier score; Coherent; Continuous ranked probability score; Cross-validation; Entropy; Kernel score; Loss function; Minimum contrast estimation; Negative definite function; Prediction interval; Predictive distribution; Quantile forecast; Scoring rule; Skill score; Strictly proper; Utility function. 1. INTRODUCTION One major purpose of statistical analysis is to make fore- casts for the future and provide suitable measures of the un- certainty associated with them. Consequently, forecasts should be probabilistic in nature, taking the form of probability distri- butions over future quantities or events (Dawid 1984). Indeed, over the past two decades, probabilistic forecasting has become routine in such applications as weather and climate prediction (Palmer 2002; Gneiting and Raftery 2005), computational fi- nance (Duffie and Pan 1997), and macroeconomic forecasting (Garratt, Lee, Pesaran, and Shin 2003; Granger 2006). In the statistical literature, advances in Markov chain Monte Carlo methodology (see, e.g., Besag, Green, Higdon, and Mengersen 1995) have led to explosive growth in the use of predictive dis- tributions, mostly in the form of Monte Carlo samples from pos- terior predictive distributions of quantities of interest. In ear- lier work (Gneiting, Raftery, Balabdaoui, and Westveld 2003; Gneiting, Balabdaoui, and Raftery 2006), we contended that the goal of probabilistic forecasting is to maximize the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between the distributional Tilmann Gneiting is Associate Professor of Statistics (E-mail: tilmann@stat. washington.edu) and Adrian E. Raftery is Blumstein-Jordan Professor of Sta- tistics and Sociology (E-mail: [email protected]), Department of Sta- tistics, University of Washington, Seattle, WA 98195. This work was supported by the DoD Multidisciplinary University Research Initiative (MURI) program administered by the Office of Naval Research under grant N00014-01-10745 and by the National Science Foundation under award 0134264. Part of Tilmann Gneiting’s work was performed on sabbatical leave at the Soil Physics Group, Universität Bayreuth, 95440 Bayreuth, Germany. The authors thank Mark Al- bright, Veronica J. Berrocal, William M. Briggs, Andreas Buja, Ignacio Cascos, Claudia Czado, A. Philip Dawid, Werner Ehm, Thomas Gerds, Eric P. Grimit, Susanne Gschlößl, Eliezer Gurarie, Mark S. Handcock, Leonhard Held, Pe- ter J. Huber, Nicholas A. Johnson, Ian T. Jolliffe, Hans Kuensch, Christian Lantuéjoul, Clifford F. Mass, Debashis Mondal, David B. Stephenson, Werner Stuetzle, Gabor J. Székely, Olivier Talagrand, Jon A. Wellner, Lawrence J. Wil- son, Robert L. Winkler, and two anonymous reviewers for providing comments, preprints, references, and data. forecasts and the observations, and is a joint property of the forecasts and the events or values that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only. Scoring rules provide summary measures for the evaluation of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that ma- terializes. In terms of elicitation, the role of scoring rules is to encourage the assessor to make careful assessments and to be honest (Garthwaite, Kadane, and O’Hagan 2005). In terms of evaluation, scoring rules measure the quality of the proba- bilistic forecasts, reward probability assessors for forecasting jobs, and rank competing forecast procedures. Meteorologists refer to this broad task as forecast verification, and much of the underlying methodology has been developed by atmospheric scientists (Jolliffe and Stephenson 2003). In a Bayesian con- text, scores are frequently referred to as utilities, emphasizing the Bayesian principle of maximizing the expected utility of a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster wishes to maximize. Specifically, if the forecaster quotes the predictive distribution P and the event x materializes, then his or her reward is S(P, x). The function S(P, · ) takes values in the real line R or in the extended real line R = [−∞, ∞], and we write S(P, Q) for the expected value of S(P, · ) under Q. Suppose, then, that the forecaster’s best judgment is the distri- butional forecast Q. The forecaster has no incentive to predict any P = Q and is encouraged to quote his or her true belief, P = Q, if S(Q, Q) S(P, Q) with equality if and only if P = Q. A scoring rule with this property is said to be strictly proper. If S(Q, Q) S(P, Q) for all P and Q, then the scoring rule is said to be proper. Propriety is essential in scientific and operational © 2007 American Statistical Association Journal of the American Statistical Association March 2007, Vol. 102, No. 477, Review Article DOI 10.1198/016214506000001437 359

Upload: others

Post on 22-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Strictly Proper Scoring Rules Predictionand Estimation

Tilmann GNEITING and Adrian E RAFTERY

Scoring rules assess the quality of probabilistic forecasts by assigning a numerical score based on the predictive distribution and on theevent or value that materializes A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from thedistribution F if he or she issues the probabilistic forecast F rather than G = F It is strictly proper if the maximum is unique In predictionproblems proper scoring rules encourage the forecaster to make careful assessments and to be honest In estimation problems strictly properscoring rules provide attractive loss and utility functions that can be tailored to the problem at hand This article reviews and develops thetheory of proper scoring rules on general probability spaces and proposes and discusses examples thereof Proper scoring rules derive fromconvex functions and relate to information measures entropy functions and Bregman divergences In the case of categorical variables weprove a rigorous version of the Savage representation Examples of scoring rules for probabilistic forecasts in the form of predictive densitiesinclude the logarithmic spherical pseudospherical and quadratic scores The continuous ranked probability score applies to probabilisticforecasts that take the form of predictive cumulative distribution functions It generalizes the absolute error and forms a special case ofa new and very general type of score the energy score Like many other scoring rules the energy score admits a kernel representationin terms of negative definite functions with links to inequalities of Hoeffding type in both univariate and multivariate settings Properscoring rules for quantile and interval forecasts are also discussed We relate proper scoring rules to Bayes factors and to cross-validationand propose a novel form of cross-validation known as random-fold cross-validation A case study on probabilistic weather forecasts inthe North American Pacific Northwest illustrates the importance of propriety We note optimum score approaches to point and quantileestimation and propose the intuitively appealing interval score as a utility function in interval estimation that addresses width as well ascoverage

KEY WORDS Bayes factor Bregman divergence Brier score Coherent Continuous ranked probability score Cross-validation EntropyKernel score Loss function Minimum contrast estimation Negative definite function Prediction interval Predictivedistribution Quantile forecast Scoring rule Skill score Strictly proper Utility function

1 INTRODUCTION

One major purpose of statistical analysis is to make fore-casts for the future and provide suitable measures of the un-certainty associated with them Consequently forecasts shouldbe probabilistic in nature taking the form of probability distri-butions over future quantities or events (Dawid 1984) Indeedover the past two decades probabilistic forecasting has becomeroutine in such applications as weather and climate prediction(Palmer 2002 Gneiting and Raftery 2005) computational fi-nance (Duffie and Pan 1997) and macroeconomic forecasting(Garratt Lee Pesaran and Shin 2003 Granger 2006) In thestatistical literature advances in Markov chain Monte Carlomethodology (see eg Besag Green Higdon and Mengersen1995) have led to explosive growth in the use of predictive dis-tributions mostly in the form of Monte Carlo samples from pos-terior predictive distributions of quantities of interest In ear-lier work (Gneiting Raftery Balabdaoui and Westveld 2003Gneiting Balabdaoui and Raftery 2006) we contended that thegoal of probabilistic forecasting is to maximize the sharpnessof the predictive distributions subject to calibration Calibrationrefers to the statistical consistency between the distributional

Tilmann Gneiting is Associate Professor of Statistics (E-mail tilmannstatwashingtonedu) and Adrian E Raftery is Blumstein-Jordan Professor of Sta-tistics and Sociology (E-mail rafteryuwashingtonedu) Department of Sta-tistics University of Washington Seattle WA 98195 This work was supportedby the DoD Multidisciplinary University Research Initiative (MURI) programadministered by the Office of Naval Research under grant N00014-01-10745and by the National Science Foundation under award 0134264 Part of TilmannGneitingrsquos work was performed on sabbatical leave at the Soil Physics GroupUniversitaumlt Bayreuth 95440 Bayreuth Germany The authors thank Mark Al-bright Veronica J Berrocal William M Briggs Andreas Buja Ignacio CascosClaudia Czado A Philip Dawid Werner Ehm Thomas Gerds Eric P GrimitSusanne Gschloumlszligl Eliezer Gurarie Mark S Handcock Leonhard Held Pe-ter J Huber Nicholas A Johnson Ian T Jolliffe Hans Kuensch ChristianLantueacutejoul Clifford F Mass Debashis Mondal David B Stephenson WernerStuetzle Gabor J Szeacutekely Olivier Talagrand Jon A Wellner Lawrence J Wil-son Robert L Winkler and two anonymous reviewers for providing commentspreprints references and data

forecasts and the observations and is a joint property of theforecasts and the events or values that materialize Sharpnessrefers to the concentration of the predictive distributions and isa property of the forecasts only

Scoring rules provide summary measures for the evaluationof probabilistic forecasts by assigning a numerical score basedon the predictive distribution and on the event or value that ma-terializes In terms of elicitation the role of scoring rules isto encourage the assessor to make careful assessments and tobe honest (Garthwaite Kadane and OrsquoHagan 2005) In termsof evaluation scoring rules measure the quality of the proba-bilistic forecasts reward probability assessors for forecastingjobs and rank competing forecast procedures Meteorologistsrefer to this broad task as forecast verification and much of theunderlying methodology has been developed by atmosphericscientists (Jolliffe and Stephenson 2003) In a Bayesian con-text scores are frequently referred to as utilities emphasizingthe Bayesian principle of maximizing the expected utility ofa predictive distribution (Bernardo and Smith 1994) We takescoring rules to be positively oriented rewards that a forecasterwishes to maximize Specifically if the forecaster quotes thepredictive distribution P and the event x materializes then hisor her reward is S(P x) The function S(P middot ) takes values inthe real line R or in the extended real line R = [minusinfininfin] andwe write S(PQ) for the expected value of S(P middot ) under QSuppose then that the forecasterrsquos best judgment is the distri-butional forecast Q The forecaster has no incentive to predictany P = Q and is encouraged to quote his or her true beliefP = Q if S(QQ) ge S(PQ) with equality if and only if P = QA scoring rule with this property is said to be strictly proper IfS(QQ) ge S(PQ) for all P and Q then the scoring rule is saidto be proper Propriety is essential in scientific and operational

copy 2007 American Statistical AssociationJournal of the American Statistical Association

March 2007 Vol 102 No 477 Review ArticleDOI 101198016214506000001437

359

360 Journal of the American Statistical Association March 2007

forecast evaluation and we present a case study that provides astriking example of the potential issues that result from the useof intuitively appealing but improper scoring rules

In estimation problems strictly proper scoring rules provideattractive loss and utility functions that can be tailored to a sci-entific problem To fix the idea suppose that we wish to fita parametric model Pθ based on a sample X1 Xn To es-timate θ we might measure the goodness-of-fit by the meanscore

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a strictly proper scoring rule If θ0 denotes thetrue parameter value then asymptotic arguments indicate thatarg maxθ Sn(θ) rarr θ0 as n rarr infin This suggests a general ap-proach to estimation Choose a strictly proper scoring rule thatis tailored to the problem at hand and use θn = arg maxθ Sn(θ)

as the optimum score estimator based on the scoring rule Pfan-zagl (1969) and Birgeacute and Massart (1993) studied this approachunder the heading of minimum contrast estimation Maximumlikelihood estimation forms a special case of optimum score es-timation and optimum score estimation forms a special case ofM-estimation (Huber 1964) in that the function to be optimizedderives from a strictly proper scoring rule

This article reviews and develops the theory of proper scor-ing rules on general probability spaces proposes and discussesexamples thereof and presents case studies The remainder ofthe article is organized as follows In Section 2 we state a fun-damental characterization theorem review the links betweenproper scoring rules information measures entropy functionsand Bregman divergences and introduce skill scores In Sec-tion 3 we turn to scoring rules for categorical variables Weprove a rigorous version of the representation of Savage (1971)and relate to a more recent characterization of Schervish (1989)that applies to probability forecasts of a dichotomous eventBremnes (2004 p 346) noted that the literature on scoring rulesfor probabilistic forecasts of continuous variables is sparse Weaddress this issue in Section 4 where we discuss the sphericalpseudospherical logarithmic and quadratic scores The contin-uous ranked probability score which lately has attracted muchattention enjoys appealing properties and might serve as a stan-dard score in evaluating probabilistic forecasts of real-valuedvariables It forms a special case of a novel and very generaltype of scoring rule the energy score In Section 5 we introducean even more general construction giving rise to kernel scoresbased on negative definite functions and inequalities of Hoeffd-ing type with side results on expectation inequalities and pos-itive definite functions In Section 6 we study scoring rules forquantile and interval forecasts We show that the class of properscoring rules for quantile forecasts is larger than conjecturedby Cervera and Muntildeoz (1996) and discuss the interval scorea scoring rule for prediction intervals that is proper and has in-tuitive appeal In Section 7 we relate proper scoring rules toBayes factors and to cross-validation and propose a novel formof cross-validation known as random-fold cross-validation InSection 8 we present a case study on the use of scoring rules inthe evaluation of probabilistic weather forecasts In Section 9we turn to optimum score estimation We discuss point quan-tile and interval estimation and propose using the interval score

as a utility function that addresses width as well as coverageWe close the article with a discussion of avenues for futurework in Section 10 Scoring rules show a superficial analogyto statistical depth functions which we hint at in an Appendix

2 CHARACTERIZATIONS OF PROPERSCORING RULES

In this section we introduce notation provide characteriza-tions of proper scoring rules and relate them to convex func-tions information measures and Bregman divergences Thediscussion here is more technical than that in the remainder ofthe article and readers with more applied interests might skipahead to Section 23 in which we discuss skill scores withoutsignificant loss of continuity

21 Proper Scoring Rules and Convex Functions

We consider probabilistic forecasts on a general samplespace Let A be a σ -algebra of subsets of and let Pbe a convex class of probability measures on (A) A func-tion defined on and taking values in the extended real lineR = [minusinfininfin] is P-quasi-integrable if it is measurable withrespect to A and is quasi-integrable with respect to all P isin P(Bauer 2001 p 64) A probabilistic forecast is any probabil-ity measure P isin P A scoring rule is any extended real-valuedfunction S P times rarr R such that S(P middot) is P-quasi-integrablefor all P isin P Thus if the forecast is P and ω materializes theforecasterrsquos reward is S(Pω) We permit algebraic operationson the extended real line and deal with the respective integralsand expectations as described in section 21 of Mattner (1997)and section 31 of Gruumlnwald and Dawid (2004) The scoringrules used in practice are mostly real-valued but there are ex-ceptions such as the logarithmic rule (Good 1952) that allowfor infinite scores

We write

S(PQ) =int

S(Pω)dQ(ω)

for the expected score under Q when the probabilistic forecastis P The scoring rule S is proper relative to P if

S(QQ) ge S(PQ) for all PQ isin P (1)

It is strictly proper relative to P if (1) holds with equality ifand only if P = Q thereby encouraging honest quotes by theforecaster If S is a proper scoring rule c gt 0 is a constantand h is a P-integrable function then

Slowast(Pω) = cS(Pω) + h(ω) (2)

is also a proper scoring rule Similarly if S is strictly properthen Slowast is strictly proper as well Following Dawid (1998) wesay that S and Slowast are equivalent and strongly equivalent ifc = 1 The term proper was apparently coined by Winkler andMurphy (1968 p 754) whereas the general idea dates back atleast to Brier (1950) and Good (1952 p 112) In a parametriccontext and with respect to estimators Lehmann and Casella(1998 p 157) refer to the defining property in (1) as risk unbi-asedness

A function G P rarr R is convex if

G((1 minus λ)P0 + λP1) le (1 minus λ)G(P0) + λG(P1)

for all λ isin (01) P0P1 isinP (3)

Gneiting and Raftery Proper Scoring Rules 361

It is strictly convex if (3) holds with equality if and only if P0 =P1 A function Glowast(P middot) rarr R is a subtangent of G at thepoint P isinP if it is integrable with respect to P quasi-integrablewith respect to all Q isinP and

G(Q) ge G(P) +int

Glowast(Pω)d(Q minus P)(ω) (4)

for all Q isin P The following characterization theorem is moregeneral and considerably simpler than previous results of Mc-Carthy (1956) and Hendrickson and Buehler (1971)

Definition 1 A scoring rule S P times rarr R is regular rel-ative to the class P if S(PQ) is real-valued for all PQ isin P except possibly that S(PQ) = minusinfin if P = Q

Theorem 1 A regular scoring rule S P times rarr R is properrelative to the class P if and only if there exists a convex real-valued function G on P such that

S(Pω) = G(P) minusint

Glowast(Pω)dP(ω) + Glowast(Pω) (5)

for P isin P and ω isin where Glowast(P middot) rarr R is a subtangentof G at the point P isin P The statement holds with proper re-placed by strictly proper and convex replaced by strictly con-vex

Proof If the scoring rule S is of the stated form then the sub-tangent inequality (4) implies the defining inequality (1) that ispropriety Conversely suppose that S is a regular proper scoringrule Define G P rarr R by G(P) = S(PP) = supQisinP S(QP)which is the pointwise supremum over a class of convex func-tions and thus is convex on P Furthermore the subtangent in-equality (4) holds with Glowast(Pω) = S(Pω) This implies therepresentation (5) and proves the claim for propriety By an ar-gument of Hendrickson and Buehler (1971) strict inequality in(1) is equivalent to no subtangent of G at P being a subtangentof G at Q for PQ isin P and P = Q which is equivalent to Gbeing strictly convex on P

Expressed slightly differently a regular scoring rule S isproper relative to the class P if and only if the expected scorefunction G(P) = S(PP) is convex and S(Pω) is a subtangentof G at the point P for all P isinP

22 Information Measures Bregman Divergencesand Decision Theory

Suppose that the scoring rule S is proper relative to theclass P Following Gruumlnwald and Dawid (2004) and BujaStuetzle and Shen (2005) we call the expected score function

G(P) = supQisinP

S(QP) P isin P (6)

the information measure or generalized entropy function asso-ciated with the scoring rule S This is the maximally achievableutility the term entropy function is used as well If S is regularand proper then we call

d(PQ) = S(QQ) minus S(PQ) PQ isin P (7)

the associated divergence function Note the order of the ar-guments which differs from previous practice in that the truedistribution Q is preceded by an alternative probabilistic fore-cast P The divergence function is nonnegative and if S is

strictly proper then d(PQ) is strictly positive unless P = QIf the sample space is finite and the entropy function is suffi-ciently smooth then the divergence function becomes the Breg-man divergence (Bregman 1967) associated with the convexfunction G Bregman divergences play major roles in optimiza-tion and have recently attracted the attention of the machinelearning community (Collins Schapire and Singer 2002) Theterm Bregman distance is also used even though d(PQ) is notnecessarily the same as d(QP)

An interesting problem is to find conditions under which adivergence function d is a score divergence in the sense thatit admits the representation (7) for a proper scoring rule S andto describe principled ways of finding such a scoring rule Thelandmark work by Savage (1971) provides a necessary condi-tion on a symmetric divergence function d to be a score di-vergence If P and Q are concentrated on the same two mutu-ally exclusive events and identified with the respective proba-bilities pq isin [01] then d(PQ) reduces to a linear functionof (p minus q)2 Dawid (1998) noted that if d is a score conver-gence then d(PQ) minus d(PprimeQ) is an affine function of Q for allPPprime isin P and proved a partial converse

Friedman (1983) and Nau (1985) studied a looser type of re-lationship between proper scoring rules and distance measureson classes of probability distributions They restricted attentionto metrics (ie distance measures that are symmetric and sat-isfy the triangle inequality) and called a scoring rule S effectivewith respect to a metric d if

S(P1Q) ge S(P2Q) lArrrArr d(P1Q) le d(P2Q)

Nau (1985) called a metric co-effective if there is a proper scor-ing rule that is effective with respect to it His proposition 1implies that the l1 linfin and Hellinger distances on spaces of ab-solutely continuous probability measures are not co-effective

Sections 3ndash5 provide numerous examples of proper scoringrules on general sample spaces along with the associated en-tropy and divergence functions For example the logarithmicscore is linked to Shannon entropy and KullbackndashLeibler diver-gence Dawid (1998 2006) Gruumlnwald and Dawid (2004) andBuja et al (2005) have given further examples of proper scoringrules entropy and divergence functions and have elaborated onthe connections to the Bregman divergence

Proper scoring rules occur naturally in statistical decisionproblems (Dawid 1998) Given an outcome space and an actionspace let U(ωa) be the utility for outcome ω and action a andlet P be a convex class of probability measures on the outcomespace Let aP denote the Bayes act for P isin P Then the scoringrule

S(Pω) = U(ωaP)

is proper relative to the class P Indeed

S(QQ) =int

U(ωaQ)dQ(ω)

geint

U(ωaP)dQ(ω) = S(PQ)

by the fact that the optimal Bayesian decision maximizes ex-pected utility Dawid (2006) has given details and discussed thegenerality of the construction

362 Journal of the American Statistical Association March 2007

23 Skill Scores

In practice scores are aggregated and competing forecastprocedures are ranked by the average score

Sn = 1

n

nsum

i=1

S(Pi xi)

over a fixed set of forecast situations We give examples ofthis in case studies in Sections 6 and 8 Recommendations forchoosing a scoring rule have been given by Winkler (19941996) by Buja et al (2005) and throughout this article

Scores for competing forecast procedures are directly com-parable if they refer to exactly the same set of forecast situ-ations If scores for distinct sets of situations are comparedthen considerable care must be exercised to separate the con-founding effects of intrinsic predictability and predictive per-formance For instance there is substantial spatial and tempo-ral variability in the predictability of weather and climate el-ements (Langland et al 1999 Campbell and Diebold 2005)Thus a score that is superior for a given location or season mightbe inferior for another or vice versa To address this issue at-mospheric scientists have put forth skill scores of the form

Sskilln = Sfcst

n minus Srefn

Soptn minus Sref

n

(8)

where Sfcstn is the forecasterrsquos score Sopt

n refers to a hypotheti-cal ideal or optimal forecast and Sref

n is the score for a referencestrategy (Murphy 1973 Potts 2003 p 27 Briggs and Ruppert2005 Wilks 2006 p 259) Skill scores are standardized in that(8) takes the value 1 for an optimal forecast which is typicallyunderstood as a point measure in the event or value that materi-alizes and the value 0 for the reference forecast Negative val-ues of a skill score indicate forecasts that are of lesser qualitythan the reference The reference forecast is typically a clima-tological forecast that is an estimate of the marginal distribu-tion of the predictand For example a climatological probabilis-tic forecast for maximum temperature on Independence Day inSeattle Washington might be a smoothed version of the localhistoric record of July 4 maximum temperatures Climatologi-cal forecasts are independent of the forecast horizon they arecalibrated by construction but often lack sharpness

Unfortunately skill scores of the form (8) are generally im-proper even if the underlying scoring rule S is proper Mur-phy (1973) studied hedging strategies in the case of the Brierskill score for probability forecasts of a dichotomous event Heshowed that the Brier skill score is asymptotically proper inthe sense that the benefits of hedging become negligible as thenumber of independent forecasts grows Similar arguments mayapply to skill scores based on other proper scoring rules Ma-sonrsquos (2004) claim of the propriety of the Brier skill score restson unjustified approximations and generally is incorrect

3 SCORING RULES FOR CATEGORICAL VARIABLES

We now review the representations of Savage (1971) andSchervish (1989) that characterize scoring rules for probabilis-tic forecasts of categorical and binary variables and give exam-ples of proper scoring rules

31 Savage Representation

We consider probabilistic forecasts of a categorical variableThus the sample space = 1 m consists of a finite num-ber m of mutually exclusive events and a probabilistic forecastis a probability vector (p1 pm) Using the notation of Sec-tion 2 we consider the convex class P = Pm where

Pm = p = (p1 pm) p1 pm ge 0p1 + middot middot middot + pm = 1

A scoring rule S can then be identified with a collection of mfunctions

S(middot i) Pm rarr R i = 1 m

In other words if the forecaster quotes the probability vector pand the event i materializes then his or her reward is S(p i)Theorem 2 is a special case of Theorem 1 and provides a rig-orous version of the Savage (1971) representation of properscoring rules on finite sample spaces Our contributions lie inthe notion of regularity the rigorous treatment and the intro-duction of appropriate tools for convex analysis (Rockafellar1970 sects 23ndash25) Specifically let G Pm rarr R be a convexfunction A vector Gprime(p) = (Gprime

1(p) Gprimem(p)) is a subgradi-

ent of G at the point p isin Pm if

G(q) ge G(p) + 〈Gprime(p)q minus p〉 (9)

for all q isin Pm where 〈middot middot〉 denotes the standard scalar productIf G is differentiable at an interior point p isin Pm then Gprime(p)

is unique and equals the gradient of G at p We assume thatthe components of Gprime(p) are real-valued except that we permitGprime

i(p) = minusinfin if pi = 0

Definition 2 A scoring rule S for categorical forecasts is reg-ular if S(middot i) is real-valued for i = 1 m except possibly thatS(p i) = minusinfin if pi = 0

Regular scoring rules assign finite scores except that a fore-cast might receive a score of minusinfin if an event claimed to be im-possible is realized The logarithmic scoring rule (Good 1952)provides a prominent example of this

Theorem 2 (McCarthy Savage) A regular scoring rule S forcategorical forecasts is proper if and only if

S(p i) = G(p) minus 〈Gprime(p)p〉 + Gprimei(p) for i = 1 m (10)

where G Pm rarr R is a convex function and Gprime(p) is a subgra-dient of G at the point p for all p isin Pm The statement holdswith proper replaced by strictly proper and convex replaced bystrictly convex

Phrased slightly differently a regular scoring rule S is properif and only if the expected score function G(p) = S(pp) isconvex on Pm and the vector with components S(p i) fori = 1 m is a subgradient of G at the point p for all p isinPmIn view of these results every bounded convex function G onPm generates a regular proper scoring rule This function Gbecomes the expected score function information measure orentropy function (6) associated with the score The divergencefunction (7) is the respective Bregman distance

We now give a number of examples The scoring rules inExamples 1ndash3 are strictly proper The score in Example 4 isproper but not strictly proper

Gneiting and Raftery Proper Scoring Rules 363

Example 1 (Quadratic or Brier score) If G(p) = summj=1 p2

j minus1 then (10) yields the quadratic score or Brier score

S(p i) = minusmsum

j=1

(δij minus pj)2 = 2pi minus

msum

j=1

p2j minus 1

where δij = 1 if i = j and δij = 0 otherwise The associ-ated Bregman divergence is the squared Euclidean distanced(pq) = summ

j=1(pj minus qj)2 This well-known scoring rule was

proposed by Brier (1950) Selten (1998) gave an axiomaticcharacterization

Example 2 (Spherical score) Let α gt 1 and consider thegeneralized entropy function G(p) = (

summj=1 pα

j )1α This cor-responds to the pseudospherical score

S(p i) = pαminus1i

(summ

j=1 pαj )(αminus1)α

which reduces to the traditional spherical score when α = 2The associated Bregman divergence is

d(pq) =(

msum

j=1

qαj

)1α

minusmsum

j=1

pjqαminus1j

(msum

j=1

qαj

)(αminus1)α

Example 3 (Logarithmic score) Negative Shannon entropyG(p) = summ

j=1 pj log pj corresponds to the logarithmic scoreS(p i) = log pi The associated Bregman distance is the Kull-backndashLeibler divergence d(pq) = summ

j=1 qj log(qjpj) [Notethe order of the arguments in the definition (7) of the divergencefunction] This scoring rule dates back at least to Good (1952)Information-theoretic perspectives and interpretations in termsof gambling returns have been given by Roulston and Smith(2002) and Daley and Vere-Jones (2004) Despite its popularitythe logarithmic score has been criticized for its unboundednesswith Selten (1998 p 51) arguing that it entails value judgmentsthat are unacceptable Feuerverger and Rahman (1992) noted aconnection to NeymanndashPearson theory and an ensuing optimal-ity property of the logarithmic score

Example 4 (Zerondashone score) The zerondashone scoring rule re-wards a probabilistic forecast if the mode of the predictive dis-tribution materializes In case of multiple modes the reward isreduced proportionally that is

S(p i) =

1M(p) if i belongs to M(p)

0 otherwise

where M(p) = i pi = maxj=1m pj denotes the set of modesof p This is also known as the misclassification loss and themeteorological literature uses the term success rate to denotecase-averaged zerondashone scores (see eg Toth Zhu and Mar-chok 2001) The associated expected score or generalized en-tropy function (6) is G(p) = maxj=1m pj and the divergencefunction (7) becomes

d(pq) = maxj=1m

qj minussum

jisinM(p) qj

M(p)

This does not define a Bregman divergence because the entropyfunction is neither differentiable nor strictly convex

The scoring rules in the foregoing examples are symmetricin the sense that

S((p1 pm) i) = S((

pπ1 pπm

)πi

)(11)

for all p isinPm for all permutations π on m elements and for allevents i = 1 m Winkler (1994 1996) argued that symmet-ric rules do not always appropriately reward forecasting skilland called for asymmetric ones particularly in situations inwhich skills scores traditionally have been used Asymmetricproper scoring rules can be generated by applying Theorem 2to convex functions G that are not invariant under coordinatepermutation

32 Schervish Representation

The classical case of a probability forecast for a dichotomousevent suggests further discussion We follow Dawid (1986) inconsidering the sample space = 10 A probabilistic fore-cast is a quoted probability p isin [01] for the event to occurA scoring rule S can be identified with a pair of functionsS(middot1) [01] rarr R and S(middot0) [01] rarr R Thus S(p1) is theforecasterrsquos reward if he or she quotes p and the event mate-rializes and S(p0) is the reward if he or she quotes p andthe event does not materialize Note the subtle change fromthe previous section where we used the convex class P2 =(p1p2) isin R

2 p1 isin [01]p2 = 1 minus p1 in place of the unit in-terval P = [01] to represent probability measures on binarysample spaces

A scoring rule for binary variables is regular if S(middot1) andS(middot0) are real-valued except possibly that S(01) = minusinfin orS(10) = minusinfin A variant of Theorem 2 shows that every regularproper scoring rule is of the form

S(p1) = G(p) + (1 minus p)Gprime(p)(12)

S(p0) = G(p) minus pGprime(p)

where G [01] rarr R is a convex function and Gprime(p) is a sub-gradient of G at the point p isin [01] in the sense that

G(q) ge G(p) + Gprime(p)(q minus p)

for all q isin [01] The statement holds with proper replaced bystrictly proper and convex replaced by strictly convex The sub-gradient Gprime(p) is real-valued except that we permit Gprime(0) =minusinfin and Gprime(1) = infin The function G is the expected score func-tion G(p) = pS(p1) + (1 minus p)S(p0) and if G is differentiableat an interior point p isin (01) then Gprime(p) is unique and equalsthe derivative of G at p Related but slightly less general resultswere given by Shuford Albert and Massengil (1966) Figure 1provides a geometric interpretation

The Savage representation (12) implies various interestingproperties of regular proper scoring rules For instance we con-clude from theorem 242 of Rockafellar (1970) that

S(p1) = limqrarr1

G(q) minusint 1

p(Gprime(q) minus Gprime(p))dq (13)

for p isin (01) and because Gprime(p) is increasing S(p1) is in-creasing as well Similarly S(p0) is decreasing as would beintuitively expected The statements hold with proper increas-ing and decreasing replaced by strictly proper strictly increas-ing and strictly decreasing Alternative proofs of these andother results have been given by Schervish (1989 the app)

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 2: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

360 Journal of the American Statistical Association March 2007

forecast evaluation and we present a case study that provides astriking example of the potential issues that result from the useof intuitively appealing but improper scoring rules

In estimation problems strictly proper scoring rules provideattractive loss and utility functions that can be tailored to a sci-entific problem To fix the idea suppose that we wish to fita parametric model Pθ based on a sample X1 Xn To es-timate θ we might measure the goodness-of-fit by the meanscore

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a strictly proper scoring rule If θ0 denotes thetrue parameter value then asymptotic arguments indicate thatarg maxθ Sn(θ) rarr θ0 as n rarr infin This suggests a general ap-proach to estimation Choose a strictly proper scoring rule thatis tailored to the problem at hand and use θn = arg maxθ Sn(θ)

as the optimum score estimator based on the scoring rule Pfan-zagl (1969) and Birgeacute and Massart (1993) studied this approachunder the heading of minimum contrast estimation Maximumlikelihood estimation forms a special case of optimum score es-timation and optimum score estimation forms a special case ofM-estimation (Huber 1964) in that the function to be optimizedderives from a strictly proper scoring rule

This article reviews and develops the theory of proper scor-ing rules on general probability spaces proposes and discussesexamples thereof and presents case studies The remainder ofthe article is organized as follows In Section 2 we state a fun-damental characterization theorem review the links betweenproper scoring rules information measures entropy functionsand Bregman divergences and introduce skill scores In Sec-tion 3 we turn to scoring rules for categorical variables Weprove a rigorous version of the representation of Savage (1971)and relate to a more recent characterization of Schervish (1989)that applies to probability forecasts of a dichotomous eventBremnes (2004 p 346) noted that the literature on scoring rulesfor probabilistic forecasts of continuous variables is sparse Weaddress this issue in Section 4 where we discuss the sphericalpseudospherical logarithmic and quadratic scores The contin-uous ranked probability score which lately has attracted muchattention enjoys appealing properties and might serve as a stan-dard score in evaluating probabilistic forecasts of real-valuedvariables It forms a special case of a novel and very generaltype of scoring rule the energy score In Section 5 we introducean even more general construction giving rise to kernel scoresbased on negative definite functions and inequalities of Hoeffd-ing type with side results on expectation inequalities and pos-itive definite functions In Section 6 we study scoring rules forquantile and interval forecasts We show that the class of properscoring rules for quantile forecasts is larger than conjecturedby Cervera and Muntildeoz (1996) and discuss the interval scorea scoring rule for prediction intervals that is proper and has in-tuitive appeal In Section 7 we relate proper scoring rules toBayes factors and to cross-validation and propose a novel formof cross-validation known as random-fold cross-validation InSection 8 we present a case study on the use of scoring rules inthe evaluation of probabilistic weather forecasts In Section 9we turn to optimum score estimation We discuss point quan-tile and interval estimation and propose using the interval score

as a utility function that addresses width as well as coverageWe close the article with a discussion of avenues for futurework in Section 10 Scoring rules show a superficial analogyto statistical depth functions which we hint at in an Appendix

2 CHARACTERIZATIONS OF PROPERSCORING RULES

In this section we introduce notation provide characteriza-tions of proper scoring rules and relate them to convex func-tions information measures and Bregman divergences Thediscussion here is more technical than that in the remainder ofthe article and readers with more applied interests might skipahead to Section 23 in which we discuss skill scores withoutsignificant loss of continuity

21 Proper Scoring Rules and Convex Functions

We consider probabilistic forecasts on a general samplespace Let A be a σ -algebra of subsets of and let Pbe a convex class of probability measures on (A) A func-tion defined on and taking values in the extended real lineR = [minusinfininfin] is P-quasi-integrable if it is measurable withrespect to A and is quasi-integrable with respect to all P isin P(Bauer 2001 p 64) A probabilistic forecast is any probabil-ity measure P isin P A scoring rule is any extended real-valuedfunction S P times rarr R such that S(P middot) is P-quasi-integrablefor all P isin P Thus if the forecast is P and ω materializes theforecasterrsquos reward is S(Pω) We permit algebraic operationson the extended real line and deal with the respective integralsand expectations as described in section 21 of Mattner (1997)and section 31 of Gruumlnwald and Dawid (2004) The scoringrules used in practice are mostly real-valued but there are ex-ceptions such as the logarithmic rule (Good 1952) that allowfor infinite scores

We write

S(PQ) =int

S(Pω)dQ(ω)

for the expected score under Q when the probabilistic forecastis P The scoring rule S is proper relative to P if

S(QQ) ge S(PQ) for all PQ isin P (1)

It is strictly proper relative to P if (1) holds with equality ifand only if P = Q thereby encouraging honest quotes by theforecaster If S is a proper scoring rule c gt 0 is a constantand h is a P-integrable function then

Slowast(Pω) = cS(Pω) + h(ω) (2)

is also a proper scoring rule Similarly if S is strictly properthen Slowast is strictly proper as well Following Dawid (1998) wesay that S and Slowast are equivalent and strongly equivalent ifc = 1 The term proper was apparently coined by Winkler andMurphy (1968 p 754) whereas the general idea dates back atleast to Brier (1950) and Good (1952 p 112) In a parametriccontext and with respect to estimators Lehmann and Casella(1998 p 157) refer to the defining property in (1) as risk unbi-asedness

A function G P rarr R is convex if

G((1 minus λ)P0 + λP1) le (1 minus λ)G(P0) + λG(P1)

for all λ isin (01) P0P1 isinP (3)

Gneiting and Raftery Proper Scoring Rules 361

It is strictly convex if (3) holds with equality if and only if P0 =P1 A function Glowast(P middot) rarr R is a subtangent of G at thepoint P isinP if it is integrable with respect to P quasi-integrablewith respect to all Q isinP and

G(Q) ge G(P) +int

Glowast(Pω)d(Q minus P)(ω) (4)

for all Q isin P The following characterization theorem is moregeneral and considerably simpler than previous results of Mc-Carthy (1956) and Hendrickson and Buehler (1971)

Definition 1 A scoring rule S P times rarr R is regular rel-ative to the class P if S(PQ) is real-valued for all PQ isin P except possibly that S(PQ) = minusinfin if P = Q

Theorem 1 A regular scoring rule S P times rarr R is properrelative to the class P if and only if there exists a convex real-valued function G on P such that

S(Pω) = G(P) minusint

Glowast(Pω)dP(ω) + Glowast(Pω) (5)

for P isin P and ω isin where Glowast(P middot) rarr R is a subtangentof G at the point P isin P The statement holds with proper re-placed by strictly proper and convex replaced by strictly con-vex

Proof If the scoring rule S is of the stated form then the sub-tangent inequality (4) implies the defining inequality (1) that ispropriety Conversely suppose that S is a regular proper scoringrule Define G P rarr R by G(P) = S(PP) = supQisinP S(QP)which is the pointwise supremum over a class of convex func-tions and thus is convex on P Furthermore the subtangent in-equality (4) holds with Glowast(Pω) = S(Pω) This implies therepresentation (5) and proves the claim for propriety By an ar-gument of Hendrickson and Buehler (1971) strict inequality in(1) is equivalent to no subtangent of G at P being a subtangentof G at Q for PQ isin P and P = Q which is equivalent to Gbeing strictly convex on P

Expressed slightly differently a regular scoring rule S isproper relative to the class P if and only if the expected scorefunction G(P) = S(PP) is convex and S(Pω) is a subtangentof G at the point P for all P isinP

22 Information Measures Bregman Divergencesand Decision Theory

Suppose that the scoring rule S is proper relative to theclass P Following Gruumlnwald and Dawid (2004) and BujaStuetzle and Shen (2005) we call the expected score function

G(P) = supQisinP

S(QP) P isin P (6)

the information measure or generalized entropy function asso-ciated with the scoring rule S This is the maximally achievableutility the term entropy function is used as well If S is regularand proper then we call

d(PQ) = S(QQ) minus S(PQ) PQ isin P (7)

the associated divergence function Note the order of the ar-guments which differs from previous practice in that the truedistribution Q is preceded by an alternative probabilistic fore-cast P The divergence function is nonnegative and if S is

strictly proper then d(PQ) is strictly positive unless P = QIf the sample space is finite and the entropy function is suffi-ciently smooth then the divergence function becomes the Breg-man divergence (Bregman 1967) associated with the convexfunction G Bregman divergences play major roles in optimiza-tion and have recently attracted the attention of the machinelearning community (Collins Schapire and Singer 2002) Theterm Bregman distance is also used even though d(PQ) is notnecessarily the same as d(QP)

An interesting problem is to find conditions under which adivergence function d is a score divergence in the sense thatit admits the representation (7) for a proper scoring rule S andto describe principled ways of finding such a scoring rule Thelandmark work by Savage (1971) provides a necessary condi-tion on a symmetric divergence function d to be a score di-vergence If P and Q are concentrated on the same two mutu-ally exclusive events and identified with the respective proba-bilities pq isin [01] then d(PQ) reduces to a linear functionof (p minus q)2 Dawid (1998) noted that if d is a score conver-gence then d(PQ) minus d(PprimeQ) is an affine function of Q for allPPprime isin P and proved a partial converse

Friedman (1983) and Nau (1985) studied a looser type of re-lationship between proper scoring rules and distance measureson classes of probability distributions They restricted attentionto metrics (ie distance measures that are symmetric and sat-isfy the triangle inequality) and called a scoring rule S effectivewith respect to a metric d if

S(P1Q) ge S(P2Q) lArrrArr d(P1Q) le d(P2Q)

Nau (1985) called a metric co-effective if there is a proper scor-ing rule that is effective with respect to it His proposition 1implies that the l1 linfin and Hellinger distances on spaces of ab-solutely continuous probability measures are not co-effective

Sections 3ndash5 provide numerous examples of proper scoringrules on general sample spaces along with the associated en-tropy and divergence functions For example the logarithmicscore is linked to Shannon entropy and KullbackndashLeibler diver-gence Dawid (1998 2006) Gruumlnwald and Dawid (2004) andBuja et al (2005) have given further examples of proper scoringrules entropy and divergence functions and have elaborated onthe connections to the Bregman divergence

Proper scoring rules occur naturally in statistical decisionproblems (Dawid 1998) Given an outcome space and an actionspace let U(ωa) be the utility for outcome ω and action a andlet P be a convex class of probability measures on the outcomespace Let aP denote the Bayes act for P isin P Then the scoringrule

S(Pω) = U(ωaP)

is proper relative to the class P Indeed

S(QQ) =int

U(ωaQ)dQ(ω)

geint

U(ωaP)dQ(ω) = S(PQ)

by the fact that the optimal Bayesian decision maximizes ex-pected utility Dawid (2006) has given details and discussed thegenerality of the construction

362 Journal of the American Statistical Association March 2007

23 Skill Scores

In practice scores are aggregated and competing forecastprocedures are ranked by the average score

Sn = 1

n

nsum

i=1

S(Pi xi)

over a fixed set of forecast situations We give examples ofthis in case studies in Sections 6 and 8 Recommendations forchoosing a scoring rule have been given by Winkler (19941996) by Buja et al (2005) and throughout this article

Scores for competing forecast procedures are directly com-parable if they refer to exactly the same set of forecast situ-ations If scores for distinct sets of situations are comparedthen considerable care must be exercised to separate the con-founding effects of intrinsic predictability and predictive per-formance For instance there is substantial spatial and tempo-ral variability in the predictability of weather and climate el-ements (Langland et al 1999 Campbell and Diebold 2005)Thus a score that is superior for a given location or season mightbe inferior for another or vice versa To address this issue at-mospheric scientists have put forth skill scores of the form

Sskilln = Sfcst

n minus Srefn

Soptn minus Sref

n

(8)

where Sfcstn is the forecasterrsquos score Sopt

n refers to a hypotheti-cal ideal or optimal forecast and Sref

n is the score for a referencestrategy (Murphy 1973 Potts 2003 p 27 Briggs and Ruppert2005 Wilks 2006 p 259) Skill scores are standardized in that(8) takes the value 1 for an optimal forecast which is typicallyunderstood as a point measure in the event or value that materi-alizes and the value 0 for the reference forecast Negative val-ues of a skill score indicate forecasts that are of lesser qualitythan the reference The reference forecast is typically a clima-tological forecast that is an estimate of the marginal distribu-tion of the predictand For example a climatological probabilis-tic forecast for maximum temperature on Independence Day inSeattle Washington might be a smoothed version of the localhistoric record of July 4 maximum temperatures Climatologi-cal forecasts are independent of the forecast horizon they arecalibrated by construction but often lack sharpness

Unfortunately skill scores of the form (8) are generally im-proper even if the underlying scoring rule S is proper Mur-phy (1973) studied hedging strategies in the case of the Brierskill score for probability forecasts of a dichotomous event Heshowed that the Brier skill score is asymptotically proper inthe sense that the benefits of hedging become negligible as thenumber of independent forecasts grows Similar arguments mayapply to skill scores based on other proper scoring rules Ma-sonrsquos (2004) claim of the propriety of the Brier skill score restson unjustified approximations and generally is incorrect

3 SCORING RULES FOR CATEGORICAL VARIABLES

We now review the representations of Savage (1971) andSchervish (1989) that characterize scoring rules for probabilis-tic forecasts of categorical and binary variables and give exam-ples of proper scoring rules

31 Savage Representation

We consider probabilistic forecasts of a categorical variableThus the sample space = 1 m consists of a finite num-ber m of mutually exclusive events and a probabilistic forecastis a probability vector (p1 pm) Using the notation of Sec-tion 2 we consider the convex class P = Pm where

Pm = p = (p1 pm) p1 pm ge 0p1 + middot middot middot + pm = 1

A scoring rule S can then be identified with a collection of mfunctions

S(middot i) Pm rarr R i = 1 m

In other words if the forecaster quotes the probability vector pand the event i materializes then his or her reward is S(p i)Theorem 2 is a special case of Theorem 1 and provides a rig-orous version of the Savage (1971) representation of properscoring rules on finite sample spaces Our contributions lie inthe notion of regularity the rigorous treatment and the intro-duction of appropriate tools for convex analysis (Rockafellar1970 sects 23ndash25) Specifically let G Pm rarr R be a convexfunction A vector Gprime(p) = (Gprime

1(p) Gprimem(p)) is a subgradi-

ent of G at the point p isin Pm if

G(q) ge G(p) + 〈Gprime(p)q minus p〉 (9)

for all q isin Pm where 〈middot middot〉 denotes the standard scalar productIf G is differentiable at an interior point p isin Pm then Gprime(p)

is unique and equals the gradient of G at p We assume thatthe components of Gprime(p) are real-valued except that we permitGprime

i(p) = minusinfin if pi = 0

Definition 2 A scoring rule S for categorical forecasts is reg-ular if S(middot i) is real-valued for i = 1 m except possibly thatS(p i) = minusinfin if pi = 0

Regular scoring rules assign finite scores except that a fore-cast might receive a score of minusinfin if an event claimed to be im-possible is realized The logarithmic scoring rule (Good 1952)provides a prominent example of this

Theorem 2 (McCarthy Savage) A regular scoring rule S forcategorical forecasts is proper if and only if

S(p i) = G(p) minus 〈Gprime(p)p〉 + Gprimei(p) for i = 1 m (10)

where G Pm rarr R is a convex function and Gprime(p) is a subgra-dient of G at the point p for all p isin Pm The statement holdswith proper replaced by strictly proper and convex replaced bystrictly convex

Phrased slightly differently a regular scoring rule S is properif and only if the expected score function G(p) = S(pp) isconvex on Pm and the vector with components S(p i) fori = 1 m is a subgradient of G at the point p for all p isinPmIn view of these results every bounded convex function G onPm generates a regular proper scoring rule This function Gbecomes the expected score function information measure orentropy function (6) associated with the score The divergencefunction (7) is the respective Bregman distance

We now give a number of examples The scoring rules inExamples 1ndash3 are strictly proper The score in Example 4 isproper but not strictly proper

Gneiting and Raftery Proper Scoring Rules 363

Example 1 (Quadratic or Brier score) If G(p) = summj=1 p2

j minus1 then (10) yields the quadratic score or Brier score

S(p i) = minusmsum

j=1

(δij minus pj)2 = 2pi minus

msum

j=1

p2j minus 1

where δij = 1 if i = j and δij = 0 otherwise The associ-ated Bregman divergence is the squared Euclidean distanced(pq) = summ

j=1(pj minus qj)2 This well-known scoring rule was

proposed by Brier (1950) Selten (1998) gave an axiomaticcharacterization

Example 2 (Spherical score) Let α gt 1 and consider thegeneralized entropy function G(p) = (

summj=1 pα

j )1α This cor-responds to the pseudospherical score

S(p i) = pαminus1i

(summ

j=1 pαj )(αminus1)α

which reduces to the traditional spherical score when α = 2The associated Bregman divergence is

d(pq) =(

msum

j=1

qαj

)1α

minusmsum

j=1

pjqαminus1j

(msum

j=1

qαj

)(αminus1)α

Example 3 (Logarithmic score) Negative Shannon entropyG(p) = summ

j=1 pj log pj corresponds to the logarithmic scoreS(p i) = log pi The associated Bregman distance is the Kull-backndashLeibler divergence d(pq) = summ

j=1 qj log(qjpj) [Notethe order of the arguments in the definition (7) of the divergencefunction] This scoring rule dates back at least to Good (1952)Information-theoretic perspectives and interpretations in termsof gambling returns have been given by Roulston and Smith(2002) and Daley and Vere-Jones (2004) Despite its popularitythe logarithmic score has been criticized for its unboundednesswith Selten (1998 p 51) arguing that it entails value judgmentsthat are unacceptable Feuerverger and Rahman (1992) noted aconnection to NeymanndashPearson theory and an ensuing optimal-ity property of the logarithmic score

Example 4 (Zerondashone score) The zerondashone scoring rule re-wards a probabilistic forecast if the mode of the predictive dis-tribution materializes In case of multiple modes the reward isreduced proportionally that is

S(p i) =

1M(p) if i belongs to M(p)

0 otherwise

where M(p) = i pi = maxj=1m pj denotes the set of modesof p This is also known as the misclassification loss and themeteorological literature uses the term success rate to denotecase-averaged zerondashone scores (see eg Toth Zhu and Mar-chok 2001) The associated expected score or generalized en-tropy function (6) is G(p) = maxj=1m pj and the divergencefunction (7) becomes

d(pq) = maxj=1m

qj minussum

jisinM(p) qj

M(p)

This does not define a Bregman divergence because the entropyfunction is neither differentiable nor strictly convex

The scoring rules in the foregoing examples are symmetricin the sense that

S((p1 pm) i) = S((

pπ1 pπm

)πi

)(11)

for all p isinPm for all permutations π on m elements and for allevents i = 1 m Winkler (1994 1996) argued that symmet-ric rules do not always appropriately reward forecasting skilland called for asymmetric ones particularly in situations inwhich skills scores traditionally have been used Asymmetricproper scoring rules can be generated by applying Theorem 2to convex functions G that are not invariant under coordinatepermutation

32 Schervish Representation

The classical case of a probability forecast for a dichotomousevent suggests further discussion We follow Dawid (1986) inconsidering the sample space = 10 A probabilistic fore-cast is a quoted probability p isin [01] for the event to occurA scoring rule S can be identified with a pair of functionsS(middot1) [01] rarr R and S(middot0) [01] rarr R Thus S(p1) is theforecasterrsquos reward if he or she quotes p and the event mate-rializes and S(p0) is the reward if he or she quotes p andthe event does not materialize Note the subtle change fromthe previous section where we used the convex class P2 =(p1p2) isin R

2 p1 isin [01]p2 = 1 minus p1 in place of the unit in-terval P = [01] to represent probability measures on binarysample spaces

A scoring rule for binary variables is regular if S(middot1) andS(middot0) are real-valued except possibly that S(01) = minusinfin orS(10) = minusinfin A variant of Theorem 2 shows that every regularproper scoring rule is of the form

S(p1) = G(p) + (1 minus p)Gprime(p)(12)

S(p0) = G(p) minus pGprime(p)

where G [01] rarr R is a convex function and Gprime(p) is a sub-gradient of G at the point p isin [01] in the sense that

G(q) ge G(p) + Gprime(p)(q minus p)

for all q isin [01] The statement holds with proper replaced bystrictly proper and convex replaced by strictly convex The sub-gradient Gprime(p) is real-valued except that we permit Gprime(0) =minusinfin and Gprime(1) = infin The function G is the expected score func-tion G(p) = pS(p1) + (1 minus p)S(p0) and if G is differentiableat an interior point p isin (01) then Gprime(p) is unique and equalsthe derivative of G at p Related but slightly less general resultswere given by Shuford Albert and Massengil (1966) Figure 1provides a geometric interpretation

The Savage representation (12) implies various interestingproperties of regular proper scoring rules For instance we con-clude from theorem 242 of Rockafellar (1970) that

S(p1) = limqrarr1

G(q) minusint 1

p(Gprime(q) minus Gprime(p))dq (13)

for p isin (01) and because Gprime(p) is increasing S(p1) is in-creasing as well Similarly S(p0) is decreasing as would beintuitively expected The statements hold with proper increas-ing and decreasing replaced by strictly proper strictly increas-ing and strictly decreasing Alternative proofs of these andother results have been given by Schervish (1989 the app)

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 3: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 361

It is strictly convex if (3) holds with equality if and only if P0 =P1 A function Glowast(P middot) rarr R is a subtangent of G at thepoint P isinP if it is integrable with respect to P quasi-integrablewith respect to all Q isinP and

G(Q) ge G(P) +int

Glowast(Pω)d(Q minus P)(ω) (4)

for all Q isin P The following characterization theorem is moregeneral and considerably simpler than previous results of Mc-Carthy (1956) and Hendrickson and Buehler (1971)

Definition 1 A scoring rule S P times rarr R is regular rel-ative to the class P if S(PQ) is real-valued for all PQ isin P except possibly that S(PQ) = minusinfin if P = Q

Theorem 1 A regular scoring rule S P times rarr R is properrelative to the class P if and only if there exists a convex real-valued function G on P such that

S(Pω) = G(P) minusint

Glowast(Pω)dP(ω) + Glowast(Pω) (5)

for P isin P and ω isin where Glowast(P middot) rarr R is a subtangentof G at the point P isin P The statement holds with proper re-placed by strictly proper and convex replaced by strictly con-vex

Proof If the scoring rule S is of the stated form then the sub-tangent inequality (4) implies the defining inequality (1) that ispropriety Conversely suppose that S is a regular proper scoringrule Define G P rarr R by G(P) = S(PP) = supQisinP S(QP)which is the pointwise supremum over a class of convex func-tions and thus is convex on P Furthermore the subtangent in-equality (4) holds with Glowast(Pω) = S(Pω) This implies therepresentation (5) and proves the claim for propriety By an ar-gument of Hendrickson and Buehler (1971) strict inequality in(1) is equivalent to no subtangent of G at P being a subtangentof G at Q for PQ isin P and P = Q which is equivalent to Gbeing strictly convex on P

Expressed slightly differently a regular scoring rule S isproper relative to the class P if and only if the expected scorefunction G(P) = S(PP) is convex and S(Pω) is a subtangentof G at the point P for all P isinP

22 Information Measures Bregman Divergencesand Decision Theory

Suppose that the scoring rule S is proper relative to theclass P Following Gruumlnwald and Dawid (2004) and BujaStuetzle and Shen (2005) we call the expected score function

G(P) = supQisinP

S(QP) P isin P (6)

the information measure or generalized entropy function asso-ciated with the scoring rule S This is the maximally achievableutility the term entropy function is used as well If S is regularand proper then we call

d(PQ) = S(QQ) minus S(PQ) PQ isin P (7)

the associated divergence function Note the order of the ar-guments which differs from previous practice in that the truedistribution Q is preceded by an alternative probabilistic fore-cast P The divergence function is nonnegative and if S is

strictly proper then d(PQ) is strictly positive unless P = QIf the sample space is finite and the entropy function is suffi-ciently smooth then the divergence function becomes the Breg-man divergence (Bregman 1967) associated with the convexfunction G Bregman divergences play major roles in optimiza-tion and have recently attracted the attention of the machinelearning community (Collins Schapire and Singer 2002) Theterm Bregman distance is also used even though d(PQ) is notnecessarily the same as d(QP)

An interesting problem is to find conditions under which adivergence function d is a score divergence in the sense thatit admits the representation (7) for a proper scoring rule S andto describe principled ways of finding such a scoring rule Thelandmark work by Savage (1971) provides a necessary condi-tion on a symmetric divergence function d to be a score di-vergence If P and Q are concentrated on the same two mutu-ally exclusive events and identified with the respective proba-bilities pq isin [01] then d(PQ) reduces to a linear functionof (p minus q)2 Dawid (1998) noted that if d is a score conver-gence then d(PQ) minus d(PprimeQ) is an affine function of Q for allPPprime isin P and proved a partial converse

Friedman (1983) and Nau (1985) studied a looser type of re-lationship between proper scoring rules and distance measureson classes of probability distributions They restricted attentionto metrics (ie distance measures that are symmetric and sat-isfy the triangle inequality) and called a scoring rule S effectivewith respect to a metric d if

S(P1Q) ge S(P2Q) lArrrArr d(P1Q) le d(P2Q)

Nau (1985) called a metric co-effective if there is a proper scor-ing rule that is effective with respect to it His proposition 1implies that the l1 linfin and Hellinger distances on spaces of ab-solutely continuous probability measures are not co-effective

Sections 3ndash5 provide numerous examples of proper scoringrules on general sample spaces along with the associated en-tropy and divergence functions For example the logarithmicscore is linked to Shannon entropy and KullbackndashLeibler diver-gence Dawid (1998 2006) Gruumlnwald and Dawid (2004) andBuja et al (2005) have given further examples of proper scoringrules entropy and divergence functions and have elaborated onthe connections to the Bregman divergence

Proper scoring rules occur naturally in statistical decisionproblems (Dawid 1998) Given an outcome space and an actionspace let U(ωa) be the utility for outcome ω and action a andlet P be a convex class of probability measures on the outcomespace Let aP denote the Bayes act for P isin P Then the scoringrule

S(Pω) = U(ωaP)

is proper relative to the class P Indeed

S(QQ) =int

U(ωaQ)dQ(ω)

geint

U(ωaP)dQ(ω) = S(PQ)

by the fact that the optimal Bayesian decision maximizes ex-pected utility Dawid (2006) has given details and discussed thegenerality of the construction

362 Journal of the American Statistical Association March 2007

23 Skill Scores

In practice scores are aggregated and competing forecastprocedures are ranked by the average score

Sn = 1

n

nsum

i=1

S(Pi xi)

over a fixed set of forecast situations We give examples ofthis in case studies in Sections 6 and 8 Recommendations forchoosing a scoring rule have been given by Winkler (19941996) by Buja et al (2005) and throughout this article

Scores for competing forecast procedures are directly com-parable if they refer to exactly the same set of forecast situ-ations If scores for distinct sets of situations are comparedthen considerable care must be exercised to separate the con-founding effects of intrinsic predictability and predictive per-formance For instance there is substantial spatial and tempo-ral variability in the predictability of weather and climate el-ements (Langland et al 1999 Campbell and Diebold 2005)Thus a score that is superior for a given location or season mightbe inferior for another or vice versa To address this issue at-mospheric scientists have put forth skill scores of the form

Sskilln = Sfcst

n minus Srefn

Soptn minus Sref

n

(8)

where Sfcstn is the forecasterrsquos score Sopt

n refers to a hypotheti-cal ideal or optimal forecast and Sref

n is the score for a referencestrategy (Murphy 1973 Potts 2003 p 27 Briggs and Ruppert2005 Wilks 2006 p 259) Skill scores are standardized in that(8) takes the value 1 for an optimal forecast which is typicallyunderstood as a point measure in the event or value that materi-alizes and the value 0 for the reference forecast Negative val-ues of a skill score indicate forecasts that are of lesser qualitythan the reference The reference forecast is typically a clima-tological forecast that is an estimate of the marginal distribu-tion of the predictand For example a climatological probabilis-tic forecast for maximum temperature on Independence Day inSeattle Washington might be a smoothed version of the localhistoric record of July 4 maximum temperatures Climatologi-cal forecasts are independent of the forecast horizon they arecalibrated by construction but often lack sharpness

Unfortunately skill scores of the form (8) are generally im-proper even if the underlying scoring rule S is proper Mur-phy (1973) studied hedging strategies in the case of the Brierskill score for probability forecasts of a dichotomous event Heshowed that the Brier skill score is asymptotically proper inthe sense that the benefits of hedging become negligible as thenumber of independent forecasts grows Similar arguments mayapply to skill scores based on other proper scoring rules Ma-sonrsquos (2004) claim of the propriety of the Brier skill score restson unjustified approximations and generally is incorrect

3 SCORING RULES FOR CATEGORICAL VARIABLES

We now review the representations of Savage (1971) andSchervish (1989) that characterize scoring rules for probabilis-tic forecasts of categorical and binary variables and give exam-ples of proper scoring rules

31 Savage Representation

We consider probabilistic forecasts of a categorical variableThus the sample space = 1 m consists of a finite num-ber m of mutually exclusive events and a probabilistic forecastis a probability vector (p1 pm) Using the notation of Sec-tion 2 we consider the convex class P = Pm where

Pm = p = (p1 pm) p1 pm ge 0p1 + middot middot middot + pm = 1

A scoring rule S can then be identified with a collection of mfunctions

S(middot i) Pm rarr R i = 1 m

In other words if the forecaster quotes the probability vector pand the event i materializes then his or her reward is S(p i)Theorem 2 is a special case of Theorem 1 and provides a rig-orous version of the Savage (1971) representation of properscoring rules on finite sample spaces Our contributions lie inthe notion of regularity the rigorous treatment and the intro-duction of appropriate tools for convex analysis (Rockafellar1970 sects 23ndash25) Specifically let G Pm rarr R be a convexfunction A vector Gprime(p) = (Gprime

1(p) Gprimem(p)) is a subgradi-

ent of G at the point p isin Pm if

G(q) ge G(p) + 〈Gprime(p)q minus p〉 (9)

for all q isin Pm where 〈middot middot〉 denotes the standard scalar productIf G is differentiable at an interior point p isin Pm then Gprime(p)

is unique and equals the gradient of G at p We assume thatthe components of Gprime(p) are real-valued except that we permitGprime

i(p) = minusinfin if pi = 0

Definition 2 A scoring rule S for categorical forecasts is reg-ular if S(middot i) is real-valued for i = 1 m except possibly thatS(p i) = minusinfin if pi = 0

Regular scoring rules assign finite scores except that a fore-cast might receive a score of minusinfin if an event claimed to be im-possible is realized The logarithmic scoring rule (Good 1952)provides a prominent example of this

Theorem 2 (McCarthy Savage) A regular scoring rule S forcategorical forecasts is proper if and only if

S(p i) = G(p) minus 〈Gprime(p)p〉 + Gprimei(p) for i = 1 m (10)

where G Pm rarr R is a convex function and Gprime(p) is a subgra-dient of G at the point p for all p isin Pm The statement holdswith proper replaced by strictly proper and convex replaced bystrictly convex

Phrased slightly differently a regular scoring rule S is properif and only if the expected score function G(p) = S(pp) isconvex on Pm and the vector with components S(p i) fori = 1 m is a subgradient of G at the point p for all p isinPmIn view of these results every bounded convex function G onPm generates a regular proper scoring rule This function Gbecomes the expected score function information measure orentropy function (6) associated with the score The divergencefunction (7) is the respective Bregman distance

We now give a number of examples The scoring rules inExamples 1ndash3 are strictly proper The score in Example 4 isproper but not strictly proper

Gneiting and Raftery Proper Scoring Rules 363

Example 1 (Quadratic or Brier score) If G(p) = summj=1 p2

j minus1 then (10) yields the quadratic score or Brier score

S(p i) = minusmsum

j=1

(δij minus pj)2 = 2pi minus

msum

j=1

p2j minus 1

where δij = 1 if i = j and δij = 0 otherwise The associ-ated Bregman divergence is the squared Euclidean distanced(pq) = summ

j=1(pj minus qj)2 This well-known scoring rule was

proposed by Brier (1950) Selten (1998) gave an axiomaticcharacterization

Example 2 (Spherical score) Let α gt 1 and consider thegeneralized entropy function G(p) = (

summj=1 pα

j )1α This cor-responds to the pseudospherical score

S(p i) = pαminus1i

(summ

j=1 pαj )(αminus1)α

which reduces to the traditional spherical score when α = 2The associated Bregman divergence is

d(pq) =(

msum

j=1

qαj

)1α

minusmsum

j=1

pjqαminus1j

(msum

j=1

qαj

)(αminus1)α

Example 3 (Logarithmic score) Negative Shannon entropyG(p) = summ

j=1 pj log pj corresponds to the logarithmic scoreS(p i) = log pi The associated Bregman distance is the Kull-backndashLeibler divergence d(pq) = summ

j=1 qj log(qjpj) [Notethe order of the arguments in the definition (7) of the divergencefunction] This scoring rule dates back at least to Good (1952)Information-theoretic perspectives and interpretations in termsof gambling returns have been given by Roulston and Smith(2002) and Daley and Vere-Jones (2004) Despite its popularitythe logarithmic score has been criticized for its unboundednesswith Selten (1998 p 51) arguing that it entails value judgmentsthat are unacceptable Feuerverger and Rahman (1992) noted aconnection to NeymanndashPearson theory and an ensuing optimal-ity property of the logarithmic score

Example 4 (Zerondashone score) The zerondashone scoring rule re-wards a probabilistic forecast if the mode of the predictive dis-tribution materializes In case of multiple modes the reward isreduced proportionally that is

S(p i) =

1M(p) if i belongs to M(p)

0 otherwise

where M(p) = i pi = maxj=1m pj denotes the set of modesof p This is also known as the misclassification loss and themeteorological literature uses the term success rate to denotecase-averaged zerondashone scores (see eg Toth Zhu and Mar-chok 2001) The associated expected score or generalized en-tropy function (6) is G(p) = maxj=1m pj and the divergencefunction (7) becomes

d(pq) = maxj=1m

qj minussum

jisinM(p) qj

M(p)

This does not define a Bregman divergence because the entropyfunction is neither differentiable nor strictly convex

The scoring rules in the foregoing examples are symmetricin the sense that

S((p1 pm) i) = S((

pπ1 pπm

)πi

)(11)

for all p isinPm for all permutations π on m elements and for allevents i = 1 m Winkler (1994 1996) argued that symmet-ric rules do not always appropriately reward forecasting skilland called for asymmetric ones particularly in situations inwhich skills scores traditionally have been used Asymmetricproper scoring rules can be generated by applying Theorem 2to convex functions G that are not invariant under coordinatepermutation

32 Schervish Representation

The classical case of a probability forecast for a dichotomousevent suggests further discussion We follow Dawid (1986) inconsidering the sample space = 10 A probabilistic fore-cast is a quoted probability p isin [01] for the event to occurA scoring rule S can be identified with a pair of functionsS(middot1) [01] rarr R and S(middot0) [01] rarr R Thus S(p1) is theforecasterrsquos reward if he or she quotes p and the event mate-rializes and S(p0) is the reward if he or she quotes p andthe event does not materialize Note the subtle change fromthe previous section where we used the convex class P2 =(p1p2) isin R

2 p1 isin [01]p2 = 1 minus p1 in place of the unit in-terval P = [01] to represent probability measures on binarysample spaces

A scoring rule for binary variables is regular if S(middot1) andS(middot0) are real-valued except possibly that S(01) = minusinfin orS(10) = minusinfin A variant of Theorem 2 shows that every regularproper scoring rule is of the form

S(p1) = G(p) + (1 minus p)Gprime(p)(12)

S(p0) = G(p) minus pGprime(p)

where G [01] rarr R is a convex function and Gprime(p) is a sub-gradient of G at the point p isin [01] in the sense that

G(q) ge G(p) + Gprime(p)(q minus p)

for all q isin [01] The statement holds with proper replaced bystrictly proper and convex replaced by strictly convex The sub-gradient Gprime(p) is real-valued except that we permit Gprime(0) =minusinfin and Gprime(1) = infin The function G is the expected score func-tion G(p) = pS(p1) + (1 minus p)S(p0) and if G is differentiableat an interior point p isin (01) then Gprime(p) is unique and equalsthe derivative of G at p Related but slightly less general resultswere given by Shuford Albert and Massengil (1966) Figure 1provides a geometric interpretation

The Savage representation (12) implies various interestingproperties of regular proper scoring rules For instance we con-clude from theorem 242 of Rockafellar (1970) that

S(p1) = limqrarr1

G(q) minusint 1

p(Gprime(q) minus Gprime(p))dq (13)

for p isin (01) and because Gprime(p) is increasing S(p1) is in-creasing as well Similarly S(p0) is decreasing as would beintuitively expected The statements hold with proper increas-ing and decreasing replaced by strictly proper strictly increas-ing and strictly decreasing Alternative proofs of these andother results have been given by Schervish (1989 the app)

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 4: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

362 Journal of the American Statistical Association March 2007

23 Skill Scores

In practice scores are aggregated and competing forecastprocedures are ranked by the average score

Sn = 1

n

nsum

i=1

S(Pi xi)

over a fixed set of forecast situations We give examples ofthis in case studies in Sections 6 and 8 Recommendations forchoosing a scoring rule have been given by Winkler (19941996) by Buja et al (2005) and throughout this article

Scores for competing forecast procedures are directly com-parable if they refer to exactly the same set of forecast situ-ations If scores for distinct sets of situations are comparedthen considerable care must be exercised to separate the con-founding effects of intrinsic predictability and predictive per-formance For instance there is substantial spatial and tempo-ral variability in the predictability of weather and climate el-ements (Langland et al 1999 Campbell and Diebold 2005)Thus a score that is superior for a given location or season mightbe inferior for another or vice versa To address this issue at-mospheric scientists have put forth skill scores of the form

Sskilln = Sfcst

n minus Srefn

Soptn minus Sref

n

(8)

where Sfcstn is the forecasterrsquos score Sopt

n refers to a hypotheti-cal ideal or optimal forecast and Sref

n is the score for a referencestrategy (Murphy 1973 Potts 2003 p 27 Briggs and Ruppert2005 Wilks 2006 p 259) Skill scores are standardized in that(8) takes the value 1 for an optimal forecast which is typicallyunderstood as a point measure in the event or value that materi-alizes and the value 0 for the reference forecast Negative val-ues of a skill score indicate forecasts that are of lesser qualitythan the reference The reference forecast is typically a clima-tological forecast that is an estimate of the marginal distribu-tion of the predictand For example a climatological probabilis-tic forecast for maximum temperature on Independence Day inSeattle Washington might be a smoothed version of the localhistoric record of July 4 maximum temperatures Climatologi-cal forecasts are independent of the forecast horizon they arecalibrated by construction but often lack sharpness

Unfortunately skill scores of the form (8) are generally im-proper even if the underlying scoring rule S is proper Mur-phy (1973) studied hedging strategies in the case of the Brierskill score for probability forecasts of a dichotomous event Heshowed that the Brier skill score is asymptotically proper inthe sense that the benefits of hedging become negligible as thenumber of independent forecasts grows Similar arguments mayapply to skill scores based on other proper scoring rules Ma-sonrsquos (2004) claim of the propriety of the Brier skill score restson unjustified approximations and generally is incorrect

3 SCORING RULES FOR CATEGORICAL VARIABLES

We now review the representations of Savage (1971) andSchervish (1989) that characterize scoring rules for probabilis-tic forecasts of categorical and binary variables and give exam-ples of proper scoring rules

31 Savage Representation

We consider probabilistic forecasts of a categorical variableThus the sample space = 1 m consists of a finite num-ber m of mutually exclusive events and a probabilistic forecastis a probability vector (p1 pm) Using the notation of Sec-tion 2 we consider the convex class P = Pm where

Pm = p = (p1 pm) p1 pm ge 0p1 + middot middot middot + pm = 1

A scoring rule S can then be identified with a collection of mfunctions

S(middot i) Pm rarr R i = 1 m

In other words if the forecaster quotes the probability vector pand the event i materializes then his or her reward is S(p i)Theorem 2 is a special case of Theorem 1 and provides a rig-orous version of the Savage (1971) representation of properscoring rules on finite sample spaces Our contributions lie inthe notion of regularity the rigorous treatment and the intro-duction of appropriate tools for convex analysis (Rockafellar1970 sects 23ndash25) Specifically let G Pm rarr R be a convexfunction A vector Gprime(p) = (Gprime

1(p) Gprimem(p)) is a subgradi-

ent of G at the point p isin Pm if

G(q) ge G(p) + 〈Gprime(p)q minus p〉 (9)

for all q isin Pm where 〈middot middot〉 denotes the standard scalar productIf G is differentiable at an interior point p isin Pm then Gprime(p)

is unique and equals the gradient of G at p We assume thatthe components of Gprime(p) are real-valued except that we permitGprime

i(p) = minusinfin if pi = 0

Definition 2 A scoring rule S for categorical forecasts is reg-ular if S(middot i) is real-valued for i = 1 m except possibly thatS(p i) = minusinfin if pi = 0

Regular scoring rules assign finite scores except that a fore-cast might receive a score of minusinfin if an event claimed to be im-possible is realized The logarithmic scoring rule (Good 1952)provides a prominent example of this

Theorem 2 (McCarthy Savage) A regular scoring rule S forcategorical forecasts is proper if and only if

S(p i) = G(p) minus 〈Gprime(p)p〉 + Gprimei(p) for i = 1 m (10)

where G Pm rarr R is a convex function and Gprime(p) is a subgra-dient of G at the point p for all p isin Pm The statement holdswith proper replaced by strictly proper and convex replaced bystrictly convex

Phrased slightly differently a regular scoring rule S is properif and only if the expected score function G(p) = S(pp) isconvex on Pm and the vector with components S(p i) fori = 1 m is a subgradient of G at the point p for all p isinPmIn view of these results every bounded convex function G onPm generates a regular proper scoring rule This function Gbecomes the expected score function information measure orentropy function (6) associated with the score The divergencefunction (7) is the respective Bregman distance

We now give a number of examples The scoring rules inExamples 1ndash3 are strictly proper The score in Example 4 isproper but not strictly proper

Gneiting and Raftery Proper Scoring Rules 363

Example 1 (Quadratic or Brier score) If G(p) = summj=1 p2

j minus1 then (10) yields the quadratic score or Brier score

S(p i) = minusmsum

j=1

(δij minus pj)2 = 2pi minus

msum

j=1

p2j minus 1

where δij = 1 if i = j and δij = 0 otherwise The associ-ated Bregman divergence is the squared Euclidean distanced(pq) = summ

j=1(pj minus qj)2 This well-known scoring rule was

proposed by Brier (1950) Selten (1998) gave an axiomaticcharacterization

Example 2 (Spherical score) Let α gt 1 and consider thegeneralized entropy function G(p) = (

summj=1 pα

j )1α This cor-responds to the pseudospherical score

S(p i) = pαminus1i

(summ

j=1 pαj )(αminus1)α

which reduces to the traditional spherical score when α = 2The associated Bregman divergence is

d(pq) =(

msum

j=1

qαj

)1α

minusmsum

j=1

pjqαminus1j

(msum

j=1

qαj

)(αminus1)α

Example 3 (Logarithmic score) Negative Shannon entropyG(p) = summ

j=1 pj log pj corresponds to the logarithmic scoreS(p i) = log pi The associated Bregman distance is the Kull-backndashLeibler divergence d(pq) = summ

j=1 qj log(qjpj) [Notethe order of the arguments in the definition (7) of the divergencefunction] This scoring rule dates back at least to Good (1952)Information-theoretic perspectives and interpretations in termsof gambling returns have been given by Roulston and Smith(2002) and Daley and Vere-Jones (2004) Despite its popularitythe logarithmic score has been criticized for its unboundednesswith Selten (1998 p 51) arguing that it entails value judgmentsthat are unacceptable Feuerverger and Rahman (1992) noted aconnection to NeymanndashPearson theory and an ensuing optimal-ity property of the logarithmic score

Example 4 (Zerondashone score) The zerondashone scoring rule re-wards a probabilistic forecast if the mode of the predictive dis-tribution materializes In case of multiple modes the reward isreduced proportionally that is

S(p i) =

1M(p) if i belongs to M(p)

0 otherwise

where M(p) = i pi = maxj=1m pj denotes the set of modesof p This is also known as the misclassification loss and themeteorological literature uses the term success rate to denotecase-averaged zerondashone scores (see eg Toth Zhu and Mar-chok 2001) The associated expected score or generalized en-tropy function (6) is G(p) = maxj=1m pj and the divergencefunction (7) becomes

d(pq) = maxj=1m

qj minussum

jisinM(p) qj

M(p)

This does not define a Bregman divergence because the entropyfunction is neither differentiable nor strictly convex

The scoring rules in the foregoing examples are symmetricin the sense that

S((p1 pm) i) = S((

pπ1 pπm

)πi

)(11)

for all p isinPm for all permutations π on m elements and for allevents i = 1 m Winkler (1994 1996) argued that symmet-ric rules do not always appropriately reward forecasting skilland called for asymmetric ones particularly in situations inwhich skills scores traditionally have been used Asymmetricproper scoring rules can be generated by applying Theorem 2to convex functions G that are not invariant under coordinatepermutation

32 Schervish Representation

The classical case of a probability forecast for a dichotomousevent suggests further discussion We follow Dawid (1986) inconsidering the sample space = 10 A probabilistic fore-cast is a quoted probability p isin [01] for the event to occurA scoring rule S can be identified with a pair of functionsS(middot1) [01] rarr R and S(middot0) [01] rarr R Thus S(p1) is theforecasterrsquos reward if he or she quotes p and the event mate-rializes and S(p0) is the reward if he or she quotes p andthe event does not materialize Note the subtle change fromthe previous section where we used the convex class P2 =(p1p2) isin R

2 p1 isin [01]p2 = 1 minus p1 in place of the unit in-terval P = [01] to represent probability measures on binarysample spaces

A scoring rule for binary variables is regular if S(middot1) andS(middot0) are real-valued except possibly that S(01) = minusinfin orS(10) = minusinfin A variant of Theorem 2 shows that every regularproper scoring rule is of the form

S(p1) = G(p) + (1 minus p)Gprime(p)(12)

S(p0) = G(p) minus pGprime(p)

where G [01] rarr R is a convex function and Gprime(p) is a sub-gradient of G at the point p isin [01] in the sense that

G(q) ge G(p) + Gprime(p)(q minus p)

for all q isin [01] The statement holds with proper replaced bystrictly proper and convex replaced by strictly convex The sub-gradient Gprime(p) is real-valued except that we permit Gprime(0) =minusinfin and Gprime(1) = infin The function G is the expected score func-tion G(p) = pS(p1) + (1 minus p)S(p0) and if G is differentiableat an interior point p isin (01) then Gprime(p) is unique and equalsthe derivative of G at p Related but slightly less general resultswere given by Shuford Albert and Massengil (1966) Figure 1provides a geometric interpretation

The Savage representation (12) implies various interestingproperties of regular proper scoring rules For instance we con-clude from theorem 242 of Rockafellar (1970) that

S(p1) = limqrarr1

G(q) minusint 1

p(Gprime(q) minus Gprime(p))dq (13)

for p isin (01) and because Gprime(p) is increasing S(p1) is in-creasing as well Similarly S(p0) is decreasing as would beintuitively expected The statements hold with proper increas-ing and decreasing replaced by strictly proper strictly increas-ing and strictly decreasing Alternative proofs of these andother results have been given by Schervish (1989 the app)

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 5: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 363

Example 1 (Quadratic or Brier score) If G(p) = summj=1 p2

j minus1 then (10) yields the quadratic score or Brier score

S(p i) = minusmsum

j=1

(δij minus pj)2 = 2pi minus

msum

j=1

p2j minus 1

where δij = 1 if i = j and δij = 0 otherwise The associ-ated Bregman divergence is the squared Euclidean distanced(pq) = summ

j=1(pj minus qj)2 This well-known scoring rule was

proposed by Brier (1950) Selten (1998) gave an axiomaticcharacterization

Example 2 (Spherical score) Let α gt 1 and consider thegeneralized entropy function G(p) = (

summj=1 pα

j )1α This cor-responds to the pseudospherical score

S(p i) = pαminus1i

(summ

j=1 pαj )(αminus1)α

which reduces to the traditional spherical score when α = 2The associated Bregman divergence is

d(pq) =(

msum

j=1

qαj

)1α

minusmsum

j=1

pjqαminus1j

(msum

j=1

qαj

)(αminus1)α

Example 3 (Logarithmic score) Negative Shannon entropyG(p) = summ

j=1 pj log pj corresponds to the logarithmic scoreS(p i) = log pi The associated Bregman distance is the Kull-backndashLeibler divergence d(pq) = summ

j=1 qj log(qjpj) [Notethe order of the arguments in the definition (7) of the divergencefunction] This scoring rule dates back at least to Good (1952)Information-theoretic perspectives and interpretations in termsof gambling returns have been given by Roulston and Smith(2002) and Daley and Vere-Jones (2004) Despite its popularitythe logarithmic score has been criticized for its unboundednesswith Selten (1998 p 51) arguing that it entails value judgmentsthat are unacceptable Feuerverger and Rahman (1992) noted aconnection to NeymanndashPearson theory and an ensuing optimal-ity property of the logarithmic score

Example 4 (Zerondashone score) The zerondashone scoring rule re-wards a probabilistic forecast if the mode of the predictive dis-tribution materializes In case of multiple modes the reward isreduced proportionally that is

S(p i) =

1M(p) if i belongs to M(p)

0 otherwise

where M(p) = i pi = maxj=1m pj denotes the set of modesof p This is also known as the misclassification loss and themeteorological literature uses the term success rate to denotecase-averaged zerondashone scores (see eg Toth Zhu and Mar-chok 2001) The associated expected score or generalized en-tropy function (6) is G(p) = maxj=1m pj and the divergencefunction (7) becomes

d(pq) = maxj=1m

qj minussum

jisinM(p) qj

M(p)

This does not define a Bregman divergence because the entropyfunction is neither differentiable nor strictly convex

The scoring rules in the foregoing examples are symmetricin the sense that

S((p1 pm) i) = S((

pπ1 pπm

)πi

)(11)

for all p isinPm for all permutations π on m elements and for allevents i = 1 m Winkler (1994 1996) argued that symmet-ric rules do not always appropriately reward forecasting skilland called for asymmetric ones particularly in situations inwhich skills scores traditionally have been used Asymmetricproper scoring rules can be generated by applying Theorem 2to convex functions G that are not invariant under coordinatepermutation

32 Schervish Representation

The classical case of a probability forecast for a dichotomousevent suggests further discussion We follow Dawid (1986) inconsidering the sample space = 10 A probabilistic fore-cast is a quoted probability p isin [01] for the event to occurA scoring rule S can be identified with a pair of functionsS(middot1) [01] rarr R and S(middot0) [01] rarr R Thus S(p1) is theforecasterrsquos reward if he or she quotes p and the event mate-rializes and S(p0) is the reward if he or she quotes p andthe event does not materialize Note the subtle change fromthe previous section where we used the convex class P2 =(p1p2) isin R

2 p1 isin [01]p2 = 1 minus p1 in place of the unit in-terval P = [01] to represent probability measures on binarysample spaces

A scoring rule for binary variables is regular if S(middot1) andS(middot0) are real-valued except possibly that S(01) = minusinfin orS(10) = minusinfin A variant of Theorem 2 shows that every regularproper scoring rule is of the form

S(p1) = G(p) + (1 minus p)Gprime(p)(12)

S(p0) = G(p) minus pGprime(p)

where G [01] rarr R is a convex function and Gprime(p) is a sub-gradient of G at the point p isin [01] in the sense that

G(q) ge G(p) + Gprime(p)(q minus p)

for all q isin [01] The statement holds with proper replaced bystrictly proper and convex replaced by strictly convex The sub-gradient Gprime(p) is real-valued except that we permit Gprime(0) =minusinfin and Gprime(1) = infin The function G is the expected score func-tion G(p) = pS(p1) + (1 minus p)S(p0) and if G is differentiableat an interior point p isin (01) then Gprime(p) is unique and equalsthe derivative of G at p Related but slightly less general resultswere given by Shuford Albert and Massengil (1966) Figure 1provides a geometric interpretation

The Savage representation (12) implies various interestingproperties of regular proper scoring rules For instance we con-clude from theorem 242 of Rockafellar (1970) that

S(p1) = limqrarr1

G(q) minusint 1

p(Gprime(q) minus Gprime(p))dq (13)

for p isin (01) and because Gprime(p) is increasing S(p1) is in-creasing as well Similarly S(p0) is decreasing as would beintuitively expected The statements hold with proper increas-ing and decreasing replaced by strictly proper strictly increas-ing and strictly decreasing Alternative proofs of these andother results have been given by Schervish (1989 the app)

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 6: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

364 Journal of the American Statistical Association March 2007

Figure 1 Schematic Illustration of the Relationships Between a Smooth Generalized Entropy Function G (solid convex curve) and the AssociatedScoring Functions and Bregman Divergence For any probability forecast p isin [0 1] the expected score S(p q) = qS(p 1)+ (1minusq)S(p 0) equals theordinate of the tangent to G at p [the solid line with slope Gprime(p)] when evaluated at q isin [0 1] In particular the scores S(p 0) = G(p) minus pGprime(p) andS(p 1) = G(p) + (1 minus p)Gprime(p) can be read off the tangent when evaluated at q = 0 and q = 1 The Bregman divergence d(p q) = S(q q) minus S(p q)equals the difference between G and its tangent at p when evaluated at q (For a similar interpretation see fig 8 in Buja et al 2005)

Schervish (1989 p 1861) suggested that his theorem 42generalizes the Savage representation Given Savagersquos (1971p 793) assessment of his representation (915) as ldquofigurativerdquothe claim can well be justified However in its rigorous form[eq (12)] the Savage representation is perfectly general

Hereinafter we let 1middot denote an indicator function thattakes value 1 if the event in brackets is true and 0 otherwise

Theorem 3 (Schervish) Suppose that S is a regular scoringrule Then S is proper and such that S(01) = limprarr0 S(p1)and S(00) = limprarr0 S(p0) and both S(p1) and S(p0) areleft continuous if and only if there exists a nonnegative mea-sure ν on (01) such that

S(p1) = S(11) minusint

(1 minus c)1p le cν(dc)

(14)

S(p0) = S(00) minusint

c1p gt cν(dc)

for all p isin [01] The scoring rule is strictly proper if and onlyif ν assigns positive measure to every open interval

Sketch of Proof Suppose that S satisfies the assumptions ofthe theorem To prove that S(p1) is of the form (14) considerthe representation (13) identify the increasing function Gprime(p)

with the left-continuous distribution function of a nonnegativemeasure ν on (01) and apply the partial integration formulaThe proof of the representation for S(p0) is analogous For theproof of the converse reverse the foregoing steps The state-ment for strict propriety follows from well-known properties ofconvex functions

A two-decision problem can be characterized by a costndashlossratio c isin (01) that reflects the relative costs of the two possibletypes of inferior decision The measure ν(dc) in Schervishrsquosrepresentation (14) assigns relevance to distinct costndashloss ratiosThis result also can be interpreted as a Choquet representationin that every left-continuous bounded scoring rule is equivalentto a mixture of cost-weighted asymmetric zerondashone scores

Sc(p1) = (1 minus c)1p gt c Sc(p0) = c1p le c (15)

with a nonnegative mixing measure ν(dc) Theorem 3 allowsfor unbounded scores requiring a slightly more elaborate state-ment Full equivalence to the Savage representation (12) canbe achieved if the regularity conditions are relaxed (Schervish1989 Buja et al 2005)

Table 1 shows the mixing measure ν(dc) for the quadraticor Brier score the spherical score the logarithmic score andthe asymmetric zerondashone score If the expected score func-tion G is smooth then ν(dc) has Lebesgue density Gprimeprime(c)(Buja et al 2005) For instance the logarithmic score derivesfrom Shannon entropy G(p) = p log p + (1 minus p) log(1 minus p)and corresponds to the infinite measure with Lebesgue density(c(1 minus c))minus1

Buja et al (2005) introduced the beta family a continuoustwo-parameter family of proper scoring rules that includes bothsymmetric and asymmetric members and derives from mixingmeasures of beta type

Example 5 (Beta family) Let αβ gt minus1 and consider thetwo-parameter family

S(p1) = minusint 1

pcαminus1(1 minus c)β dc

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 7: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 365

Table 1 Proper Scoring Rules for Probability Forecasts of a Dichotomous Event and the Respective Mixing Measure or Lebesgue Densityin the Schervish Representation (14)

Scoring rule S(p 1) S(p 0) ν(dc)

Brier minus(1 minus p)2 minusp2 UniformSpherical p(1 minus 2p + 2p2)minus12 (1 minus p)(1 minus 2p + 2p2)minus12 (1 minus 2c + 2c2)minus32

Logarithmic log p log (1 minus p) (c (1 minus c))minus1

Zerondashone (1 minus c)1p gt c c 1p le c Point measure in c

S(p0) = minusint p

0cα(1 minus c)βminus1 dc

which is of the form (14) for a mixing measure ν(dc) withLebesgue density cαminus1(1minusc)βminus1 This family includes the log-arithmic score (α = β = 0) and versions of the Brier score (α =β = 1) and the zerondashone score (15) with c = 1

2 (α = β rarr infin)as special or limiting cases Asymmetric members arise whenα = β with the scoring rule S(p1) = p minus 1 and S(p0) =p + log(1 minus p) being one such example (α = 0 β = 1)

Winkler (1994) proposed a method for constructing asym-metric scoring rules from symmetric scoring rules Specificallyif S is a symmetric proper scoring rule and c isin (01) then

Slowast(p1) = S(p1) minus S(c1)

T(cp)

(16)

Slowast(p0) = S(p0) minus S(c0)

T(cp)

where T(cp) = S(00) minus S(c0) if p le c and T(cp) =S(11) minus S(c1) if p gt c is also a proper scoring rule stan-dardized in the sense that the expected score function attainsa minimum value of 0 at p = c and a maximum value of 1 atp = 0 and p = 1

Example 6 (Winklerrsquos score) Tetlock (2005) explored whatconstitutes good judgment in predicting future political andeconomic events and looked at why experts are often wrong intheir forecasts In evaluating expertsrsquo predictions he adjustedfor the difficulty of the forecast task by using the special caseof (16) that derives from the Brier score that is

Slowast(p1) = (1 minus c)2 minus (1 minus p)2

c21p le c + (1 minus c)21p gt c (17)

Slowast(p0) = c2 minus p2

c21p le c + (1 minus c)21p gt c

with the value of c isin (01) adapted to reflect a baseline proba-bility This was suggested by Winkler (1994 1996) as an alter-native to using skill scores

Figure 2 shows the expected score or generalized entropyfunction G(p) and the scoring functions S(p1) and S(p0)for the quadratic or Brier score and the logarithmic score (Ta-ble 1) the asymmetric zerondashone score (15) with c = 6 andWinklerrsquos standardized score (17) with c = 2

4 SCORING RULES FOR CONTINUOUS VARIABLES

Bremnes (2004 p 346) noted that the literature on scor-ing rules for probabilistic forecasts of continuous variables issparse We address this issue in the following

41 Scoring Rules for Density Forecasts

Let micro be a σ -finite measure on the measurable space (A)For α gt 1 let Lα denote the class of probability measures on(A) that are absolutely continuous with respect to micro andhave micro-density p such that

pα =(int

p(ω)αmicro(dω)

)1α

is finite We identify a probabilistic forecast P isin Lα withits micro-density p and call p a predictive density or densityforecast Predictive densities are defined only up to a set ofmicro-measure zero Whenever appropriate we follow Bernardo(1979 p 689) and use the unique version defined by p(ω) =limρrarr0 P(Sρ(ω))micro(Sρ(ω)) where Sρ(ω) is a sphere of ra-dius ρ centered at ω

We begin by discussing scoring rules that correspond to Ex-amples 1 2 and 3 The quadratic score

QS(pω) = 2p(ω) minus p22 (18)

is strictly proper relative to the class L2 It has expected score orgeneralized entropy function G(p) = p2

2 and the associateddivergence function d(pq) = p minus q2

2 is symmetric Good(1971) proposed the pseudospherical score

PseudoS(pω) = p(ω)αminus1pαminus1α

that reduces to the spherical score when α = 2 He describedoriginal and generalized versions of the scoremdasha distinc-tion that in a measure-theoretic framework is obsolete Thepseudospherical score is strictly proper relative to the classLα The strict convexity of the associated entropy functionG(p) = pα and the nonnegativity of the divergence functionare straightforward consequences of the Houmllder and Minkowskiinequalities

The logarithmic score

LogS(pω) = log p(ω) (19)

emerges as a limiting case (α rarr 1) of the pseudosphericalscore when suitably scaled This scoring rule was proposedby Good (1952) and has been widely used since then undervarious names including the predictive deviance (Knorr-Heldand Rainer 2001) and the ignorance score (Roulston and Smith2002) The logarithmic score is strictly proper relative to theclass L1 of the probability measures dominated by micro The asso-ciated expected score function or information measure is nega-tive Shannon entropy and the divergence function becomes theclassical KullbackndashLeibler divergence

Bernardo (1979 p 689) argued that ldquowhen assessing theworthiness of a scientistrsquos final conclusions only the proba-bility he attaches to a small interval containing the true value

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 8: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

366 Journal of the American Statistical Association March 2007

Figure 2 The Expected Score or Generalized Entropy Function G(p) (top row) and the Scoring Functions S(p 1) ( mdash) and S(p 0) ( - - -) (bottomrow) for the Brier Score and Logarithmic Score (Table 1) the Asymmetric ZerondashOne Score (15) With c = 6 and Winklerrsquos Standardized Score (17)With c = 2

should be taken into accountrdquo This seems subject to debateand atmospheric scientists have argued otherwise putting forthscoring rules that are sensitive to distance (Epstein 1969 Staeumllvon Holstein 1970) That said Bernardo (1979) studied localscoring rules S(pω) that depend on the predictive density ponly through its value at the event ω that materializes Assum-ing regularity conditions he showed that every proper localscoring rule is equivalent to the logarithmic score in the senseof (2) Consequently the linear score LinS(pω) = p(ω) isnot a proper scoring rule despite its intuitive appeal For in-stance let ϕ and u denote the Lebesgue densities of a standardGaussian distribution and the uniform distribution on (minusε ε)If ε lt

radiclog 2 then

LinS(u ϕ) = 1

(2π)12

1

int ε

minusε

eminusx22 dx

gt1

2π12= LinS(ϕϕ)

in violation of propriety Essentially the linear score encour-ages overprediction at the modes of an assessorrsquos true predic-tive density (Winkler 1969) The probability score of WilsonBurrows and Lanzinger (1999) integrates the predictive den-sity over a neighborhood of the observed real-valued quantityThis resembles the linear score and is not a proper score eitherDawid (2006) constructed proper scoring rules from improper

ones an interesting question is whether this can be done forthe probability score similar to the way in which the properquadratic score (18) derives from the linear score

If Lebesgue densities on the real line are used to predict dis-crete observations then the logarithmic score encourages theplacement of artificially high density ordinates on the target val-ues in question This problem emerged in the Evaluating Pre-dictive Uncertainty Challenge at a recent PASCAL ChallengesWorkshop (Kohonen and Suomela 2006 Quintildeonero-CandelaRasmussen Sinz Bousquet and Schoumllkopf 2006) It disappearsif scores expressed in terms of predictive cumulative distribu-tion functions are used or if the sample space is reduced to thetarget values in question

42 Continuous Ranked Probability Score

The restriction to predictive densities is often impracticalFor instance probabilistic quantitative precipitation forecastsinvolve distributions with a point mass at zero (Krzysztofow-icz and Sigrest 1999 Bremnes 2004) and predictive distribu-tions are often expressed in terms of samples possibly origi-nating from Markov chain Monte Carlo Thus it seems morecompelling to define scoring rules directly in terms of predic-tive cumulative distribution functions Furthermore the afore-mentioned scores are not sensitive to distance meaning that nocredit is given for assigning high probabilities to values near butnot identical to the one materializing

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 9: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 367

To address this situation let P consist of the Borel proba-bility measures on R We identify a probabilistic forecastmdasha member of the class Pmdashwith its cumulative distribution func-tion F and use standard notation for the elements of the samplespace R The continuous ranked probability score (CRPS) isdefined as

CRPS(F x) = minusint infin

minusinfin(F(y) minus 1y ge x)2 dy (20)

and corresponds to the integral of the Brier scores for the asso-ciated binary probability forecasts at all real-valued thresholds(Matheson and Winkler 1976 Hersbach 2000)

Applications of the CRPS have been hampered by a lack ofreadily computable solutions to the integral in (20) and the useof numerical quadrature rules has been proposed instead (Staeumllvon Holstein 1977 Unger 1985) However the integral oftencan be evaluated in closed form By lemma 22 of Baringhausand Franz (2004) or identity (17) of Szeacutekely and Rizzo (2005)

CRPS(F x) = 1

2EF|X minus Xprime| minus EF|X minus x| (21)

where X and Xprime are independent copies of a random variablewith distribution function F and finite first moment If the pre-dictive distribution is Gaussian with mean micro and variance σ 2then it follows that

CRPS(N (microσ 2) x) = σ

[1radicπ

minus 2ϕ

(x minus micro

σ

)

minus x minus micro

σ

(2

(x minus micro

σ

)minus 1

)]

where ϕ and denote the probability density function and thecumulative distribution function of a standard Gaussian vari-able If the predictive distribution takes the form of a sample ofsize n then the right side of (20) can be evaluated in terms ofthe respective order statistics in a total of O(n log n) operations(Hersbach 2000 sec 4b)

The CRPS is proper relative to the class P and strictly properrelative to the subclass P1 of the Borel probability measuresthat have finite first moment The associated expected scorefunction or information measure

G(F) = minusint infin

minusinfinF(y)(1 minus F(y))dy = minus1

2EF|X minus Xprime|

coincides with the negative selectivity function (Matheron1984) and the respective divergence function

d(FG) =int infin

minusinfin(F(y) minus G(y))2 dy

is symmetric and of the Crameacuterndashvon Mises typeThe CRPS lately has attracted renewed interest in the at-

mospheric sciences community (Hersbach 2000 Candille andTalagrand 2005 Gneiting Raftery Westveld and Goldman2005 Grimit Gneiting Berrocal and Johnson 2006 Wilks2006 pp 302ndash303) It is typically used in negative orientationsay CRPSlowast(F x) = minusCRPS(F x) The representation (21) thencan be written as

CRPSlowast(F x) = EF|X minus x| minus 1

2EF|X minus Xprime|

which sheds new light on the score In negative orientation theCRPS can be reported in the same unit as the observations andit generalizes the absolute error to which it reduces if F is a de-terministic forecastmdashthat is a point measure Thus the CRPSprovides a direct way to compare deterministic and probabilis-tic forecasts

43 Energy Score

We introduce a generalization of the CRPS that draws onSzeacutekelyrsquos (2003) statistical energy perspective Let Pβ β isin(02) denote the class of the Borel probability measures P onR

m that are such that EPXβ is finite where middot denotes theEuclidean norm We define the energy score

ES(Px) = 1

2EPX minus Xprimeβ minus EPX minus xβ (22)

where X and Xprime are independent copies of a random vectorwith distribution P isin Pβ This generalizes the CRPS to which(22) reduces when β = 1 and m = 1 by allowing for an indexβ isin (02) and applying to distributional forecasts of a vector-valued quantity in R

m Theorem 1 of Szeacutekely (2003) shows thatthe energy score is strictly proper relative to the class Pβ [Fora different and more general argument see Sec 51] In the lim-iting case β = 2 the energy score (22) reduces to the negativesquared error

ES(Px) = minusmicrop minus x2 (23)

where microP denotes the mean vector of P This scoring rule isregular and proper but not strictly proper relative to the classP2

The energy score with index β isin (02) applies to all Borelprobability measures on R

m by defining

ES(Px) = minusβ2βminus2(m2 + β

2 )

πm2(1 minus β2 )

int

Rm

|φP(y) minus ei〈xy〉|2ym+β

dy

(24)where φP denotes the characteristic function of P If P belongsto Pβ then theorem 1 of Szeacutekely (2003) implies the equality ofthe right sides in (22) and (24) Essentially the score computesa weighted distance between the characteristic function of Pand the characteristic function of the point measure at the valuethat materializes

44 Scoring Rules That Depend on First andSecond Moments Only

An interesting question is that for proper scoring rules thatapply to the Borel probability measures on R

m and depend onthe predictive distribution P only through its mean vector microPand dispersion or covariance matrix P Dawid (1998) andDawid and Sebastiani (1999) studied proper scoring rules ofthis type A particularly appealing example is the scoring rule

S(Px) = minus log detP minus (x minus microP)primeminus1P (x minus microP) (25)

which is linked to the generalized entropy function

G(P) = minus log detP minus m

and to the divergence function

d(PQ) = tr(minus1P Q) minus log det(minus1

P Q)

+ (microP minus microQ)primeminus1P (microP minus microQ) minus m

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 10: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

368 Journal of the American Statistical Association March 2007

[Note the order of the arguments in the definition (7) of thedivergence function] This scoring rule is proper but not strictlyproper relative to the class P2 of the Borel probability measuresP for which EPX2 is finite It is strictly proper relative to anyconvex class of probability measures characterized by the firsttwo moments such as the Gaussian measures for which (25) isequivalent to the logarithmic score (19) For other examples ofscoring rules that depend on microP and P only see (23) and theright column of table 1 of Dawid and Sebastiani (1999)

The predictive model choice criterion of Laud and Ibrahim(1995) and Gelfand and Ghosh (1998) has lately attracted theattention of the statistical community Suppose that we fit a pre-dictive model to observed real-valued data x1 xn The pre-dictive model choice criterion (PMCC) assesses the model fitthrough the quantity

PMCC =nsum

i=1

(xi minus microi)2 +

nsum

i=1

σ 2i

where microi and σ 2i denote the expected value and the variance of

a replicate variable Xi given the model and the observationsWithin the framework of scoring rules the PMCC correspondsto the positively oriented score

S(P x) = minus(x minus microP)2 minus σ 2P (26)

where P has mean microP and variance σ 2P The scoring rule (26)

depends on the predictive distribution through its first two mo-ments only but it is improper if the forecasterrsquos true belief is Pand if he or she wishes to maximize the expected score thenhe or she will quote the point measure at microPmdashthat is a de-terministic forecastmdashrather than the predictive distribution PThis suggests that the predictive model choice criterion shouldbe replaced by a criterion based on the scoring rule (25) whichreduces to

S(P x) = minus(

x minus microP

σP

)2

minus logσ 2P (27)

in the case in which m = 1 and the observations are real-valued

5 KERNEL SCORES NEGATIVE AND POSITIVEDEFINITE FUNCTIONS AND INEQUALITIES

OF HOEFFDING TYPE

In this section we use negative definite functions to constructproper scoring rules and present expectation inequalities thatare of independent interest

51 Kernel Scores

Let be a nonempty set A real-valued function g on times

is said to be a negative definite kernel if it is symmetric in itsarguments and

sumni=1

sumnj=1 aiajg(xi xj) le 0 for all positive inte-

gers n all a1 an isin R that sum to 0 and all x1 xn isin Numerous examples of negative definite kernels have beengiven by Berg Christensen and Ressel (1984) and the refer-ences cited therein

We now give the key result of this section which generalizesa kernel construction of Eaton (1982 p 335) The term kernelscore was coined by Dawid (2006)

Theorem 4 Let be a Hausdorff space and let g be a non-negative continuous negative definite kernel on times For aBorel probability measure P on let X and Xprime be independentrandom variables with distribution P Then the scoring rule

S(P x) = 1

2EPg(XXprime) minus EPg(X x) (28)

is proper relative to the class of the Borel probability mea-sures P on for which the expectation EPg(XXprime) is finite

Proof Let P and Q be Borel probability measures on andsuppose that XXprime and YY prime are independent random variateswith distribution P and Q We need to show that

minus1

2EQg(YY prime) ge 1

2EPg(XXprime) minus EPQg(XY) (29)

If the expectation EPQg(XY) is infinite then the inequalityis trivially satisfied if it is finite then theorem 21 of Berget al (1984 p 235) implies (29)

Next we give examples of scoring rules that admit a kernelrepresentation In each case we equip the sample space withthe standard topology Note that evaluating the kernel scores isstraightforward if P is discrete and has only a moderate numberof atoms

Example 7 (Quadratic or Brier score) Let = 10 andsuppose that g(00) = g(11) = 0 and g(01) = g(10) = 1Then (28) recovers the quadratic or Brier score

Example 8 (CRPS) If = R and g(x xprime) = |x minus xprime| forx xprime isin R in Theorem 4 we obtain the CRPS (21)

Example 9 (Energy score) If = Rm β isin (02) and

g(xxprime) = x minus xprimeβ for xxprime isin Rm where middot denotes the

Euclidean norm then (28) recovers the energy score (22)

Example 10 (CRPS for circular variables) We let = S de-note the circle and write α(θ θ prime) for the angular distance be-tween two points θ θ prime isin S Let P be a Borel probability mea-sure on S and let and prime be independent random variateswith distribution P By theorem 1 of Gneiting (1998) angulardistance is a negative definite kernel Thus

S(P θ) = 1

2EPα(prime) minus EPα( θ) (30)

defines a proper scoring rule relative to the class of the Borelprobability measures on the circle Grimit et al (2006) intro-duced (30) as an analog of the CRPS (21) that applies to di-rectional variables and used Fourier analytic tools to prove thepropriety of the score

We turn to a far-reaching generalization of the energyscore For x = (x1 xm) isin R

m and α isin (0infin] definethe vector norm xα = (

summi=1 |xi|α)1α if α isin (0infin) and

xα = max1leilem |xi| if α = infin Schoenbergrsquos theorem (Berget al 1984 p 74) and a strand of literature culminating in thework of Koldobskiı (1992) and Zastavnyi (1993) imply that ifα isin (0infin] and β gt 0 then the kernel

g(xxprime) = x minus xprimeβα xxprime isin R

m

is negative definite if and only if the following holds

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 11: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 369

Assumption 1 Suppose that (a) m = 1 α isin (0infin] and β isin(02] (b) m ge 2 α isin (02] and β isin (0 α] or (c) m = 2 α isin(2infin] and β isin (01]

Example 11 (Non-Euclidean energy score) Under Assump-tion 1 the scoring rule

S(Px) = 1

2EPX minus Xprimeβ

α minus EPX minus xβα

is proper relative to the class of the Borel probability mea-sures P on R

m for which the expectation EPX minus Xprimeβα is fi-

nite If m = 1 or α = 2 then we recover the energy score ifm ge 2 and α = 2 then we obtain non-Euclidean analogs Mat-tner (1997 sec 52) showed that if α ge 1 then EPQX minus Yβ

α

is finite if and only if EPXβα and EQYβ

α are finite In partic-

ular if α ge 1 then EPX minus Xprimeβα is finite if and only if EPXβ

α

is finite

The following result sharpens Theorem 4 in the crucial caseof Euclidean sample spaces and spherically symmetric negativedefinite functions Recall that a function η on (0infin) is saidto be completely monotone if it has derivatives η(k) of all ordersand (minus1)kη(k)(t) ge 0 for all nonnegative integers k and all t gt 0

Theorem 5 Let ψ be a continuous function on [0infin) withminusψ prime completely monotone and not constant For a Borel prob-ability measure P on R

m let X and Xprime be independent randomvectors with distribution P Then the scoring rule

S(Px) = 1

2EPψ(X minus Xprime2

2) minus EPψ(X minus x22)

is strictly proper relative to the class of the Borel probabilitymeasures P on R

m for which EPψ(X minus Xprime22) is finite

The proof of this result is immediate from theorem 22 ofMattner (1997) In particular if ψ(t) = tβ2 for β isin (02) thenTheorem 5 ensures the strict propriety of the energy score rela-tive to the class of the Borel probability measures P on R

m forwhich EPXβ

2 is finite

52 Inequalities of Hoeffding Type and PositiveDefinite Kernels

A number of side results seem to be of independent inter-est even though they are easy consequences of previous workBriefly if the expectations EPg(XXprime) and EPg(YY prime) are finitethen (29) can be written as a Hoeffding-type inequality

2EPQg(XY) minus EPg(XXprime) minus EQg(YY prime) ge 0 (31)

Theorem 1 of Szeacutekely and Rizzo (2005) provides a nearly iden-tical result and a converse If g is not negative definite thenthere are counterexamples to (31) and the respective scoringrule is improper Furthermore if is a group and the negativedefinite function g satisfies g(x xprime) = g(minusxminusxprime) for x xprime isin then a special case of (31) can be stated as

EPg(XminusXprime) ge EPg(XXprime) (32)

In particular if = Rm and Assumption 1 holds then inequal-

ities (31) and (32) apply and reduce to

2EX minus Yβα minus EX minus Xprimeβ

α minus EY minus Yprimeβα ge 0 (33)

and

EX minus Xprimeβα le EX + Xprimeβ

α (34)

thereby generalizing results of Buja Logan Reeds and Shepp(1994) Szeacutekely (2003) and Baringhaus and Franz (2004)

In the foregoing case in which is a group and g satisfiesg(x xprime) = g(minusxminusxprime) for x xprime isin the argument leading to the-orem 23 of Buja et al (1994) and theorem 4 of Ma (2003)implies that

h(x xprime) = g(xminusxprime) minus g(x xprime) x xprime isin (35)

is a positive definite kernel in the sense that h is symmetric inits arguments and

sumni=1

sumnj=1 aiajh(xi xj) ge 0 for all positive

integers n all a1 an isin R and all x1 xn isin Specifi-cally under Assumption 1

h(xxprime) = x + xprimeβα minus x minus xprimeβ

α xxprime isin Rm (36)

is a positive definite kernel a result that extends and completesthe aforementioned theorem of Buja et al (1994)

53 Constructions With Complex-Valued Kernels

With suitable modifications the foregoing results allow forcomplex-valued kernels A complex-valued function h on times is said to be a positive definite kernel if it is Hermitian that ish(x xprime) = h(xprime x) for x xprime isin and

sumni=1

sumnj=1 cicjh(xi xj) ge 0

for all positive integers n all c1 cn isin C and all x1 xn isin The general idea (Dawid 1998 2006) is that if h is continu-ous and positive definite then

S(P x) = EPh(X x) + EPh(xX) minus EPh(XXprime) (37)

defines a proper scoring rule If h is positive definite then g =minush is negative definite thus if h is real-valued and sufficientlyregular then the scoring rules (37) and (28) are equivalent

In the next example we discuss scoring rules for Borel prob-ability measures and observations on Euclidean spaces How-ever the representation (37) allows for the construction ofproper scoring rules in more general settings such as prob-abilistic forecasts of structured data including strings se-quences graphs and sets based on positive definite kernelsdefined on such structures (Hofmann Schoumllkopf and Smola2005)

Example 12 Let = Rm and y isin R

m and consider the pos-itive definite kernel h(xxprime) = ei〈xminusxprimey〉 minus 1 where xxprime isin R

mThen (37) reduces to

S(Px) = minus∣∣φP(y) minus ei〈xy〉∣∣2 (38)

that is the negative squared distance between the characteristicfunction of the predictive distribution φP and the characteris-tic function of the point measures in the value that materializesevaluated at y isin R

m If we integrate with respect to a nonnega-tive measure micro(dy) then the scoring rule (38) generalizes to

S(Px) = minusint

Rm

∣∣φP(y) minus ei〈xy〉∣∣2micro(dy) (39)

If the measure micro is finite and assigns positive mass to all inter-vals then this scoring rule is strictly proper relative to the classof the Borel probability measures on R

m Eaton Giovagnoliand Sebastiani (1996) used the associated divergence function

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 12: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

370 Journal of the American Statistical Association March 2007

to define metrics for probability measures If micro is the infinitemeasure with Lebesgue density yminusmminusβ where β isin (02)then the scoring rule (39) is equivalent to the Euclidean energyscore (24)

6 SCORING RULES FOR QUANTILE ANDINTERVAL FORECASTS

Occasionally full predictive distributions are difficult tospecify and the forecaster might quote predictive quantilessuch as value at risk in financial applications (Duffie and Pan1997) or prediction intervals (Christoffersen 1998) only

61 Proper Scoring Rules for Quantiles

We consider probabilistic forecasts of a continuous quantitythat take the form of predictive quantiles Specifically supposethat the quantiles at the levels α1 αk isin (01) are soughtIf the forecaster quotes quantiles r1 rk and x materializesthen he or she will be rewarded by the score S(r1 rk x) Wedefine

S(r1 rkP) =int

S(r1 rk x)dP(x)

as the expected score under the probability measure P whenthe forecaster quotes the quantiles r1 rk To avoid technicalcomplications we suppose that P belongs to the convex class Pof Borel probability measures on R that have finite moments ofall orders and whose distribution function is strictly increasingon R For P isin P let q1 qk denote the true P-quantiles atlevels α1 αk Following Cervera and Muntildeoz (1996) we saythat a scoring rule S is proper if

S(q1 qkP) ge S(r1 rkP)

for all real numbers r1 rk and for all probability measuresP isin P If S is proper then the forecaster who wishes to maxi-mize the expected score is encouraged to be honest and to vol-unteer his or her true beliefs

To avoid technical overhead we tacitly assume P-integrabil-ity whenever appropriate Essentially we require that the func-tions s(x) and h(x) in (40) and (42) be P-measurable and growat most polynomially in x Theorem 6 addresses the predictionof a single quantile Corollary 1 turns to the general case

Theorem 6 If s is nondecreasing and h is arbitrary then thescoring rule

S(r x) = αs(r) + (s(x) minus s(r))1x le r + h(x) (40)

is proper for predicting the quantile at level α isin (01)

Proof Let q be the unique α-quantile of the probability mea-sure P isinP We identify P with the associated distribution func-tion so that P(q) = α If r lt q then

S(qP) minus S(rP)

=int

(rq)

s(x)dP(x) + s(r)P(r) minus αs(r)

ge s(r)(P(q) minus P(r)) + s(r)P(r) minus αs(r)

= 0

as desired If r gt q then an analogous argument applies

If s(x) = x and h(x) = minusαx then we obtain the scoring rule

S(r x) = (x minus r)(1x le r minus α) (41)

which has been proposed by Koenker and Machado (1999)Taylor (1999) Giacomini and Komunjer (2005) Theis (2005p 232) and Friederichs and Hense (2006) for measuring in-sample goodness of fit and out-of-sample forecast performancein meteorological and financial applications In negative orien-tation the econometric literature refers to the scoring rule (41)as the tick or check loss function

Corollary 1 If si is nondecreasing for i = 1 k and h isarbitrary then the scoring rule

S(r1 rk x)

=ksum

i=1

[αisi(ri) + (si(x) minus si(ri))1x le ri

] + h(x) (42)

is proper for predicting the quantiles at levels α1 αk isin(01)

Cervera and Muntildeoz (1996 pp 515 and 519) proved Corol-lary 1 in the special case in which each si is linear They askedwhether the resulting rules are the only proper ones for quan-tiles Our results give a negative answer that is the class ofproper scoring rules for quantiles is considerably larger thananticipated by Cervera and Muntildeoz We do not know whetheror not (40) and (42) provide the general form of proper scoringrules for quantiles

62 Interval Score

Interval forecasts form a crucial special case of quantile pre-diction We consider the classical case of the central (1 minus α) times100 prediction interval with lower and upper endpoints thatare the predictive quantiles at level α

2 and 1 minus α2 We denote a

scoring rule for the associated interval forecast by Sα(lu x)where l and u represent for the quoted α

2 and 1 minus α2 quantiles

Thus if the forecaster quotes the (1minusα)times100 central predic-tion interval [lu] and x materializes then his or her score willbe Sα(lu x) Putting α1 = α

2 α2 = 1 minus α2 s1(x) = s2(x) = 2 x

α

and h(x) = minus2 xα

in (42) and reversing the sign of the scoringrule yields the negatively oriented interval score

Sintα (lu x)

= (u minus l) + 2

α(l minus x)1x lt l + 2

α(x minus u)1x gt u (43)

This scoring rule has intuitive appeal and can be traced backto Dunsmore (1968) Winkler (1972) and Winkler and Mur-phy (1979) The forecaster is rewarded for narrow predictionintervals and he or she incurs a penalty the size of which de-pends on α if the observation misses the interval In the caseα = 1

2 Hamill and Wilks (1995 p 622) used a scoring rule thatis equivalent to the interval score They noted that ldquoa strategyfor gaming [ ] was not obviousrdquo thereby conjecturing propri-ety which is confirmed by the foregoing We anticipate novelapplications particularly for the evaluation of volatility fore-casts in computational finance

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 13: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 371

63 Case Study Interval Forecasts for a ConditionallyHeteroscedastic Process

This section illustrates the use of the interval score in atime series context Kabaila (1999) called for rigorous ways ofspecifying prediction intervals for conditionally heteroscedasticprocesses and proposed a relevance criterion in terms of con-ditional coverage and width dependence We contend that thenotion of proper scoring rules provides an alternative and pos-sibly simpler more general and more rigorous paradigm Theprediction intervals that we deem appropriate derive from thetrue conditional distribution as implied by the data-generatingmechanism and optimize the expected value of all proper scor-ing rules

To fix the idea consider the stationary bilinear processXt t isin Z defined by

Xt+1 = 1

2Xt + 1

2Xtεt + εt (44)

where the εtrsquos are independent standard Gaussian random vari-ates Kabaila and He (2001) studied central one-step-ahead pre-diction intervals at the 95 level The process is Markovianand the conditional distribution of Xt+1 given XtXtminus1 isGaussian with mean 1

2 Xt and variance (1 + 12 Xt)

2 thereby sug-gesting the prediction interval

I =[

1

2Xt minus c

∣∣∣∣1 + 1

2Xt

∣∣∣∣1

2Xt + c

∣∣∣∣1 + 1

2Xt

∣∣∣∣

] (45)

where c = minus1(975) This interval satisfies the relevance prop-erty of Kabaila (1999) and Kabaila and He (2001) adopted Ias the standard prediction interval We agree with this choicebut we prefer the aforementioned more direct justification theprediction interval I is the standard interval because its lowerand upper endpoints are the 25 and 975 percentiles of thetrue conditional distribution function Kabaila and He consid-ered two alternative prediction intervals

J = [Fminus1(025)Fminus1(975)] (46)

where F denotes the unconditional stationary distribution func-tion of Xt and

K =[

1

2Xt minus γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)

1

2Xt + γ

(∣∣∣∣1 + 1

2Xt

∣∣∣∣

)] (47)

where γ (y) = (2(log 736minus log y))12y for y le 736 and γ (y) =0 otherwise This choice minimizes the expected width of theprediction interval under the constraint of nominal coverageHowever the interval forecast K seems misguided in that itcollapses to a point forecast when the conditional predictivevariance is highest

We generated a sample path Xt t = 1 100001 fromthe bilinear process (44) and considered sequential one-step-ahead interval forecasts for Xt+1 where t = 1 100000Table 2 summarizes the results of this experiment The inter-val forecasts I J and K all showed close to nominal coveragewith the prediction interval K being sharpest on average Nev-ertheless the classical prediction interval I performed best interms of the interval score

Table 2 Comparison of One-Step-Ahead 95 Interval Forecasts forthe Stationary Bilinear Process (44)

Interval Empirical Average Averageforecast coverage width interval score

I (45) 9501 400 477J (46) 9508 545 804K (47) 9498 379 532

NOTE The table shows the empirical coverage the average width and the average value ofthe negatively oriented interval score (43) for the prediction intervals I J and K in 100000sequential forecasts in a sample path of length 100001 See text for details

64 Scoring Rules for Distributional Forecasts

Specifying a predictive cumulative distribution function isequivalent to specifying all predictive quantiles thus we canbuild scoring rules for predictive distributions from scoringrules for quantiles Matheson and Winkler (1976) and Cerveraand Muntildeoz (1996) suggested ways of doing this Specificallyif Sα denotes a proper scoring rule for the quantile at level α

and ν is a Borel measure on (01) then the scoring rule

S(F x) =int 1

0Sα(Fminus1(α) x)ν(dα) (48)

is proper subject to regularity and integrability constraintsSimilarly we can build scoring rules for predictive distrib-

utions from scoring rules for binary probability forecasts If Sdenotes a proper scoring rule for probability forecasts and ν isa Borel measure on R then the scoring rule

S(F x) =int infin

minusinfinS(F(y)1x le y)ν(dy) (49)

is proper subject to integrability constraints (Matheson andWinkler 1976 Gerds 2002) The CRPS (20) corresponds to thespecial case in (49) in which S is the quadratic or Brier scoreand ν is the Lebesgue measure If S is the Brier score and ν

is a sum of point measures then the ranked probability score(Epstein 1969) emerges

The construction carries over to multivariate settings If Pdenotes the class of the Borel probability measures on R

m thenwe identify a probabilistic forecast P isin P with its cumulativedistribution function F A multivariate analog of the CRPS canbe defined as

CRPS(Fx) = minusint

Rm(F(y) minus 1x le y)2ν(dy)

This is a weighted integral of the Brier scores at all m-variatethresholds The Borel measure ν can be chosen to encouragethe forecaster to concentrate his or her efforts on the impor-tant ones If ν is a finite measure that dominates the Lebesguemeasure then this scoring rule is strictly proper relative to theclass P

7 SCORING RULES BAYES FACTORS ANDRANDOMndashFOLD CROSSndashVALIDATION

We now relate proper scoring rules to Bayes factors and tocross-validation and propose a novel form of cross-validationrandom-fold cross-validation

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 14: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

372 Journal of the American Statistical Association March 2007

71 Logarithmic Score and Bayes Factors

Probabilistic forecasting rules are often generated by proba-bilistic models and the standard Bayesian approach to compar-ing probabilistic models is by Bayes factors Suppose that wehave a sample X = (X1 Xn) of values to be forecast Sup-pose also that we have two forecasting rules based on proba-bilistic models H1 and H2 So far in this article we have concen-trated on the situation where the forecasting rule is completelyspecified before any of the Xirsquos are observed that is there areno parameters to be estimated from the data being forecast Inthat situation the Bayes factor for H1 against H2 is

B = P(X|H1)

P(X|H2) (50)

where P(X|Hk) = prodni=1 P(Xi|Hk) for k = 12 (Jeffreys 1939

Kass and Raftery 1995)Thus if the logarithmic score is used then the log Bayes

factor is the difference of the scores for the two models

log B = LogS(H1X) minus LogS(H2X) (51)

This was pointed out by Good (1952) who called the log Bayesfactor the weight of evidence It establishes two connections(1) the Bayes factor is equivalent to the logarithmic score in thisno-parameter case and (2) the Bayes factor applies more gener-ally than merely to the comparison of parametric probabilisticmodels but also to the comparison of probabilistic forecastingrules of any kind

So far in this article we have taken probabilistic forecasts tobe fully specified but often they are specified only up to un-known parameters estimated from the data Now suppose thatthe forecasting rules considered are specified only up to un-known parameters θk for Hk to be estimated from the dataThen the Bayes factor is still given by (50) but now P(X|Hk) isthe integrated likelihood

P(X|Hk) =int

p(X|θkHk)p(θk|Hk)dθk

where p(X|θkHk) is the (usual) likelihood under model Hk andp(θk|Hk) is the prior distribution of the parameter θk

Dawid (1984) showed that when the data come in a partic-ular order such as time order the integrated likelihood can bereformulated in predictive terms

P(X|Hk) =nprod

t=1

P(Xt|Xtminus1Hk) (52)

where Xtminus1 = X1 Xtminus1 if t ge 1 X0 is the empty set andP(Xt|Xtminus1Hk) is the predictive distribution of Xt given the pastvalues under Hk namely

P(Xt|Xtminus1Hk) =int

p(Xt|θkHk)P(θk|Xtminus1Hk)dθk

with P(θk|Xtminus1Hk) the posterior distribution of θk given thepast observations Xtminus1

We let SkB = log P(X|Hk) denote the log-integrated likeli-hood viewed now as a scoring rule To view it as a scoring ruleit helps to rewrite it as

SkB =nsum

t=1

log P(Xt|Xtminus1Hk) (53)

Dawid (1984) showed that SkB is asymptotically equivalent tothe plug-in maximum likelihood prequential score

SkD =nsum

t=1

log P(Xt|Xtminus1 θ tminus1k ) (54)

where θ tminus1k is the maximum likelihood estimator (MLE) of

θk based on the past observations Xtminus1 in the sense thatSkDSkB rarr 1 as n rarr infin Initial terms for which θ tminus1

k is pos-sibly undefined can be ignored Dawid also showed that SkB

is asymptotically equivalent to the Bayes information criterion(BIC) score

SkBIC =nsum

t=1

log P(Xt|Xtminus1 θnk ) minus dk

2log n

where dk = dim(θk) in the same sense namely SkBICSkB rarr1 as n rarr infin This justifies using the BIC for comparing fore-casting rules extending the previous justification of Schwarz(1978) which related only to comparing models

These results have two limitations however First they as-sume that the data come in a particular order Second they useonly the logarithmic score not other scores that might be moreappropriate for the task at hand We now briefly consider howthese limitations might be addressed

72 Scoring Rules and Random-Fold Cross-Validation

Suppose now that the data are unordered We can replace (53)by

SlowastkB =

nsum

t=1

ED[log p

(Xt|X(D)Hk

)] (55)

where D is a random sample from 1 t minus 1 t + 1 nthe size of which is a random variable with a discrete uniformdistribution on 01 n minus 1 Dawidrsquos results imply that thisis asymptotically equivalent to the plug-in maximum likelihoodversion

SlowastkD =

nsum

t=1

ED[log p

(Xt|X(D) θ

(D)k Hk

)] (56)

where θ(D)k is the MLE of θk based on X(D) Terms for which

the size of D is small and θ(D)k is possibly undefined can be

ignoredThe formulations (55) and (56) may be useful because they

turn a score that was a sum of nonidentically distributed termsinto one that is a sum of identically distributed exchangeableterms This opens the possibility of evaluating Slowast

kB or SlowastkD

by Monte Carlo which would be a form of cross-validationIn this cross-validation the amount of data left out would berandom rather than fixed leading us to call it random-foldcross-validation Smyth (2000) used the log-likelihood as thecriterion function in cross-validation as here calling the result-ing method cross-validated likelihood but used a fixed hold-out sample size This general approach can be traced back atleast to Geisser and Eddy (1979) One issue in cross-validationgenerally is how much data to leave out different choices leadto different versions of cross-validation such as leave-one-out

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 15: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 373

10-fold and so on Considering versions of cross-validation inthe context of scoring rules may shed some light on this issue

We have seen by (51) that when there are no parameters beingestimated the Bayes factor is equivalent to the difference inthe logarithmic score Thus we could replace the logarithmicscore by another proper score and the difference in scores couldbe viewed as a kind of predictive Bayes factor with a differenttype of score In SkB SkD SkBIC Slowast

kB and SlowastkD we could

replace the terms in the sums (each of which has the form of alogarithmic score) by another proper scoring rule such as theCRPS and we conjecture that similar asymptotic equivalenceswould remain valid

8 CASE STUDY PROBABILISTIC FORECASTS OFSEAndashLEVEL PRESSURE OVER THE NORTH

AMERICAN PACIFIC NORTHWEST

Our goals in this case study are to illustrate the use and theproperties of scoring rules and to demonstrate the importanceof propriety

81 Probabilistic Weather Forecasting Using Ensembles

Operational probabilistic weather forecasts are based on en-semble prediction systems Ensemble systems typically gener-ate a set of perturbations of the best estimate of the current stateof the atmosphere run each of them forward in time using a nu-merical weather prediction model and use the resulting set offorecasts as a sample from the predictive distribution of futureweather quantities (Palmer 2002 Gneiting and Raftery 2005)

Grimit and Mass (2002) described the University of Wash-ington ensemble prediction system over the Pacific Northwestwhich covers Oregon Washington British Columbia and partsof the Pacific Ocean This is a five-member ensemble com-prising distinct runs of the MM5 numerical weather predictionmodel with initial conditions taken from distinct national andinternational weather centers We consider 48-hour-ahead fore-casts of sea-level pressure in JanuaryndashJune 2000 the same pe-riod as that on which the work of Grimit and Mass was basedThe unit used is the millibar (mb) Our analysis builds on a ver-ification data base of 16015 records scattered over the NorthAmerican Pacific Northwest and the aforementioned 6-monthperiod Each record consists of the five ensemble member fore-casts and the associated verifying observation The root meansquared error of the ensemble mean forecast was 330 mb andthe square root of the average variance of the five-member fore-cast ensemble was 213 mb resulting in a ratio of r0 = 155

This underdispersive behaviormdashthat is observed errors thattend to be larger on average than suggested by the ensemblespreadmdashis typical of ensemble systems and seems unavoidablegiven that ensembles capture only some of the sources of uncer-tainty (Raftery Gneiting Balabdaoui and Polakowski 2005)Thus to obtain calibrated predictive distributions it seems nec-essary to carry out some form of statistical postprocessing Onenatural approach is to take the predictive distribution for sea-level pressure at any given site as Gaussian centered at the en-semble mean forecast and with predictive standard deviationequal to r times the standard deviation of the forecast ensembleDensity forecasts of this type were proposed by Deacutequeacute Royerand Stroe (1994) and Wilks (2002) Following Wilks we referto r as an inflation factor

82 Evaluation of Density Forecasts

In the aforementioned approach the predictive density isGaussian say ϕmicrorσ its mean micro is the ensemble mean fore-cast and its standard deviation rσ is the product of the in-flation factor r and the standard deviation of the five-memberforecast ensemble σ We considered various scoring rules Sand computed the average score

s(r) = 1

16015

16015sum

i=1

S(ϕmicroirσi xi

) r gt 0 (57)

as a function of the inflation factor r The index i refers to theith record in the verification database and xi denotes the valuethat materialized Given the underdispersive character of the en-semble system we expect s(r) to be maximized at some r gt 1possibly near the observed ratio r0 = 155 of the root meansquared error of the ensemble mean forecast over the squareroot of the average ensemble variance

We computed the mean score (57) for inflation factors r isin(05) and for the quadratic score (QS) spherical score (SphS)logarithmic score (LogS) CRPS linear score (LinS) and prob-ability score (PS) as defined in Section 4 Briefly if p denotesthe predictive density and x denotes the observed value then

QS(p x) = 2p(x) minusint infin

minusinfinp(y)2 dy

SphS(p x) = p(x)(int infin

minusinfinp(y)2 dy

)12

LogS(p x) = log p(x)

CRPS(p x) = 1

2Ep|X minus Xprime| minus Ep|X minus x|

LinS(p x) = p(x)

and

PS(p x) =int x+1

xminus1p(y)dy

Figure 3 and Table 3 summarize the results of this experimentThe scores shown in the figure are linearly transformed so thatthe graphs can be compared side by side and the transforma-tions are listed in the rightmost column of the table In the caseof the quadratic score for instance we plotted 40 times thevalue in (57) plus 6 Clearly transformed and original scoresare equivalent in the sense of (2) The quadratic score sphericalscore logarithmic score and CRPS were maximized at valuesof r gt 1 thereby confirming the underdispersive character of

Table 3 Probabilistic Forecasts of Sea-Level Pressure Over the NorthAmerican Pacific Northwest in JanuaryndashJuly 2000

Argmaxr s( r ) Linear transformationScore in eq (57) plotted in Figure 3

Quadratic score (QS) 218 40s + 6Spherical score (SphS) 184 108s minus 22Logarithmic score (LogS) 241 s + 13CRPS 162 10s + 8

Linear score (LinS) 05 105s minus 5Probability score (PS) 02 60s minus 5

NOTE The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 16: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

374 Journal of the American Statistical Association March 2007

Figure 3 Probabilistic Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 The scores are shownas a function of the inflation factor r where the predictive density is Gaussian centered at the ensemble mean forecast and with predictive standarddeviation equal to r times the standard deviation of the forecast ensemble The scores were subject to linear transformations as detailed in Table 3

the ensemble These scores are proper The linear and proba-bility scores were maximized at r = 05 and r = 02 therebysuggesting ignorable forecast uncertainty and essentially deter-ministic forecasts The latter two scores have intuitive appealand the probability score has been used to assess forecast en-sembles (Wilson et al 1999) However they are improper andtheir use may result in misguided scientific inferences as in thisexperiment A similar comment applies to the predictive modelchoice criterion given in Section 44

It is interesting to observe that the logarithmic score gave thehighest maximizing value of r The logarithmic score is strictlyproper but involves a harsh penalty for low probability eventsand thus is highly sensitive to extreme cases Our verificationdatabase includes a number of low-spread cases for which theensemble variance implodes The logarithmic score penalizesthe resulting predictions unless the inflation factor r is largeWeigend and Shi (2000 p 382) noted similar concerns andconsidered the use of trimmed means when computing the log-arithmic score In our experience the CRPS is less sensitive toextreme cases or outliers and provides an attractive alternative

83 Evaluation of Interval Forecasts

The aforementioned predictive densities also provide intervalforecasts We considered the central (1 minusα)times 100 predictioninterval where α = 50 and α = 10 The associated lower andupper prediction bounds li and ui are the α

2 and 1 minus α2 quantiles

of a Gaussian distribution with mean microi and standard deviationrσi as described earlier We assessed the interval forecasts in

their dependence on the inflation factor r in two ways by com-puting the empirical coverage of the prediction intervals and bycomputing

sα(r) = 1

16015

16015sum

i=1

Sintα (liui xi) r gt 0 (58)

where Sintα denotes the negatively oriented interval score (43)

This scoring rule assesses both calibration and sharpness byrewarding narrow prediction intervals and penalizing intervalsmissed by the observation Figure 4(a) shows the empirical cov-erage of the interval forecasts Clearly the coverage increaseswith r For α = 50 and α = 10 the nominal coverage was ob-tained at r = 178 and r = 211 which confirms the underdis-persive character of the ensemble Figure 4(b) shows the inter-val score (58) as a function of the inflation factor r For α = 50and α = 10 the score was optimized at r = 156 and r = 172

9 OPTIMUM SCORE ESTIMATION

Strictly proper scoring rules also are of interest in estimationproblems where they provide attractive loss and utility func-tions that can be adapted to the problem at hand

91 Point Estimation

We return to the generic estimation problem described inSection 1 Suppose that we wish to fit a parametric model Pθ

based on a sample X1 Xn of identically distributed obser-vations To estimate θ we can measure the goodness of fit by

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 17: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 375

(a) (b)

Figure 4 Interval Forecasts of Sea-Level Pressure Over the North American Pacific Northwest in JanuaryndashJuly 2000 (a) Nominal and actualcoverage and (b) the negatively oriented interval score (58) for the 50 central prediction interval (α = 50 - - -) and the 90 central predictioninterval (α = 10 mdash score scaled by a factor of 60) The predictive density is Gaussian centered at the ensemble mean forecast and withpredictive standard deviation equal to r times the standard deviation of the forecast ensemble

the mean score

Sn(θ) = 1

n

nsum

i=1

S(Pθ Xi)

where S is a scoring rule that is strictly proper relative to a con-vex class of probability measures that contains the parametricmodel If θ0 denotes the true parameter value then asymptoticarguments indicate that

arg maxθSn(θ) rarr θ0 as n rarr infin (59)

This suggests a general approach to estimation Choose astrictly proper scoring rule tailored to the problem at hand andtake θn = arg maxθSn(θ) as the respective optimum score es-timator The first four values of the arg max in Table 3 forinstance refer to the optimum score estimates of the infla-tion factor r based on the logarithmic score spherical scorequadratic score and CRPS Pfanzagl (1969) and Birgeacute andMassart (1993) studied optimum score estimators under theheading of minimum contrast estimators This class includesmany of the most popular estimators in various situations suchas MLEs least squares and other estimators of regression mod-els and estimators for mixture models or deconvolution Pfan-zagl (1969) proved rigorous versions of the consistency result(59) and Birgeacute and Massart (1993) related rates of convergenceto the entropy structure of the parameter space Maximum like-lihood estimation forms the special case of optimum score esti-mation based on the logarithmic score and optimum score es-timation forms a special case of M-estimation (Huber 1964)in that the function to be optimized derives from a strictlyproper scoring rule When estimating the location parameter in

a Gaussian population with known variance for example theoptimum score estimator based on the CRPS amounts to an M-estimator with a ψ -function of the form ψ(x) = 2( x

c ) minus 1where c is a positive constant and denotes the standardGaussian cumulative This provides a smooth version of the ψ -function for Huberrsquos (1964) robust minimax estimator (see Hu-ber 1981 p 208) Asymptotic results for M-estimators such asthe consistency theorems of Huber (1967) and Perlman (1972)then apply to optimum scores estimators as well Waldrsquos (1949)classical proof of the consistency of MLEs relies heavily on thestrict propriety of the logarithmic score which is proved in hislemma 1

The appeal of optimum score estimation lies in the potentialadaption of the scoring rule to the problem at hand Gneitinget al (2005) estimated a predictive regression model using theoptimum score estimator based on the CRPSmdasha choice mo-tivated by the meteorological problem They showed empiri-cally that such an approach can yield better predictive resultsthan approaches using maximum likelihood plug-in estimatesThis agrees with the findings of Copas (1983) and Friedman(1989) who showed that the use of maximum likelihood andleast squares plug-in estimates can be suboptimal in predictionproblems Buja et al (2005) argued that strictly proper scor-ing rules are the natural loss functions or fitting criteria in bi-nary class probability estimation and proposed tailoring scor-ing rules in situations in which false positives and false nega-tives have different cost implications

92 Quantile Estimation

Koenker and Bassett (1978) proposed quantile regression us-ing an optimum score estimator based on the proper scoringrule (41)

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 18: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

376 Journal of the American Statistical Association March 2007

93 Interval Estimation

We now turn to interval estimation Casella Hwang andRobert (1993 p 141) pointed out that ldquothe question of measur-ing optimality (either frequentist or Bayesian) of a set estimatoragainst a loss criterion combining size and coverage does notyet have a satisfactory answerrdquo

Their work was motivated by an apparent paradox due toJ O Berger which concerns interval estimators of the loca-tion parameter θ in a Gaussian population with unknown scaleUnder the loss function

L(I θ) = cλ(I) minus 1θ isin I (60)

where c is a positive constant and λ(I) denotes the Lebesguemeasure of the interval estimate I the classical t-interval isdominated by a misguided interval estimate that shrinks to thesample mean in the cases of the highest uncertainty Casellaet al (1993 p 145) commented that ldquowe have a case wherea disconcerting rule dominates a time honored procedure Theonly reasonable conclusion is that there is a problem with theloss functionrdquo We concur and propose using proper scoringrules to assess interval estimators based on a loss criterion thatcombines width and coverage

Specifically we contend that a meaningful comparison of in-terval estimators requires either equal coverage or equal widthThe loss function (60) applies to all set estimates regardlessof coverage and size which seems unnecessarily ambitiousInstead we focus attention on interval estimators with equalnominal coverage and use the negatively oriented interval score(43) This loss function can be written as

Lα(I θ) = λ(I) + 2

αinfηisinI

|θ minus η| (61)

and applies to interval estimates with upper and lower ex-ceedance probability α

2 times 100 This approach can again betraced back to Dunsmore (1968) and Winkler (1972) and avoidsparadoxes as a consequence of the propriety of the intervalscore Compared with (60) the loss function (61) provides amore flexible assessment of the coverage by taking the distancebetween the interval estimate and the estimand into account

10 AVENUES FOR FUTURE WORK

Our paper aimed to bring proper scoring rules to the atten-tion of a broad statistical and general scientific audience Properscoring rules lie at the heart of much statistical theory and prac-tice and we have demonstrated ways in which they bear on pre-diction and estimation We close with a succinct necessarilyincomplete and subjective discussion of directions for futurework

Theoretically the relationships between proper scoring rulesand divergence functions are not fully understood The Sav-age representation (10) Schervishrsquos Choquet-type representa-tion (14) and the underlying geometric arguments surely allowgeneralizations and the characterization of proper scoring rulesfor quantiles remains open Little is known about the propri-ety of skill scores despite Murphyrsquos (1973) pioneering workand their ubiquitous use by meteorologists Briggs and Ruppert(2005) have argued that skill score departures from proprietydo little harm Although we tend to agree there is a need forfollow-up studies Diebold and Mariano (1995) Hamill (1999)

Briggs (2005) Briggs and Ruppert (2005) and Jolliffe (2006)have developed formal tests of forecast performance skill andvalue This is a promising avenue for future work particu-larly in concert with biomedical applications (Pepe 2003 Schu-macher Graf and Gerds 2003) Proper scoring rules form keytools within the broader framework of diagnostic forecast eval-uation (Murphy and Winkler 1992 Gneiting et al 2006) and inaddition to hydrometeorological and biomedical uses we see awealth of potential applications in computational finance

Guidelines for the selection of scoring rules are in strong de-mand both for the assessment of predictive performance andin optimum score approaches to estimation The tailoring ap-proach of Buja et al (2005) applies to binary class probabil-ity estimation and we wonder whether it can be generalizedLast but not least we anticipate novel applications of properscoring rules in model selection and model diagnosis problemsparticularly in prequential (Dawid 1984) and cross-validatoryframeworks and including Bayesian posterior predictive distri-butions and Markov chain Monte Carlo output (Gschloumlszligl andCzado 2005) More traditional approaches to model selectionsuch as Bayes factors (Kass and Raftery 1995) the Akaike in-formation criterion the BIC and the deviance information cri-terion (Spiegelhalter Best Carlin and van der Linde 2002) arelikelihood-based and relate to the logarithmic scoring rule asdiscussed in Section 7 We would like to know more about theirrelationships to cross-validatory approaches based directly onproper scoring rules including but not limited to the logarith-mic rule

APPENDIX STATISTICAL DEPTH FUNCTIONS

Statistical depth functions (Zuo and Serfling 2000) provide usefultools in nonparametric inference for multivariate data In Section 1we hinted at a superficial analogy to scoring rules Specifically if Pis a Borel probability measure on R

m then a depth function D(Px)

gives a P-based center-outward ordering of points x isin Rm Formally

this resembles a scoring rule S(Px) that assigns a P-based numericalvalue to an event x isin R

m Liu (1990) and Zuo and Serfling (2000) havelisted desirable properties of depth functions including maximality atthe center monotonicity relative to the deepest point affine invarianceand vanishing at infinity The latter two properties are not necessarilydefendable requirements for scoring rules conversely propriety is ir-relevant for depth functions

[Received December 2005 Revised September 2006]

REFERENCES

Baringhaus L and Franz C (2004) ldquoOn a New Multivariate Two-SampleTestrdquo Journal of Multivariate Analysis 88 190ndash206

Bauer H (2001) Measure and Integration Theory Berlin Walter de GruijterBerg C Christensen J P R and Ressel P (1984) Harmonic Analysis on

Semigroups New York Springer-VerlagBernardo J M (1979) ldquoExpected Information as Expected Utilityrdquo The An-

nals of Statistics 7 686ndash690Bernardo J M and Smith A F M (1994) Bayesian Theory New York Wi-

leyBesag J Green P Higdon D and Mengersen K (1995) ldquoBayesian Com-

puting and Stochastic Systemsrdquo Statistical Science 10 3ndash66Birgeacute L and Massart P (1993) ldquoRates of Convergence for Minimum Contrast

Estimatorsrdquo Probability Theory and Related Fields 97 113ndash150Bregman L M (1967) ldquoThe Relaxation Method of Finding the Common Point

of Convex Sets and Its Application to the Solution of Problems in Convex Pro-grammingrdquo USSR Computational Mathematics and Mathematical Physics 7200ndash217

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 19: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

Gneiting and Raftery Proper Scoring Rules 377

Bremnes J B (2004) ldquoProbabilistic Forecasts of Precipitation in Termsof Quantiles Using NWP Model Outputrdquo Monthly Weather Review 132338ndash347

Brier G W (1950) ldquoVerification of Forecasts Expressed in Terms of Probabil-ityrdquo Monthly Weather Review 78 1ndash3

Briggs W (2005) ldquoA General Method of Incorporating Forecast Cost and Lossin Value Scoresrdquo Monthly Weather Review 133 3393ndash3397

Briggs W and Ruppert D (2005) ldquoAssessing the Skill of YesNo Predic-tionsrdquo Biometrics 61 799ndash807

Buja A Logan B F Reeds J A and Shepp L A (1994) ldquoInequalitiesand Positive-Definite Functions Arising From a Problem in MultidimensionalScalingrdquo The Annals of Statistics 22 406ndash438

Buja A Stuetzle W and Shen Y (2005) ldquoLoss Functions for Binary ClassProbability Estimation and Classification Structure and Applicationsrdquo man-uscript available at www-statwhartonupennedu~buja

Campbell S D and Diebold F X (2005) ldquoWeather Forecasting for WeatherDerivativesrdquo Journal of the American Statistical Association 100 6ndash16

Candille G and Talagrand O (2005) ldquoEvaluation of Probabilistic PredictionSystems for a Scalar Variablerdquo Quarterly Journal of the Royal Meteorologi-cal Society 131 2131ndash2150

Casella G Hwang J T G and Robert C (1993) ldquoA Paradox in Decision-Theoretic Interval Estimationrdquo Statistica Sinica 3 141ndash155

Cervera J L and Muntildeoz J (1996) ldquoProper Scoring Rules for Fractilesrdquo inBayesian Statistics 5 eds J M Bernardo J O Berger A P Dawid andA F M Smith Oxford UK Oxford University Press pp 513ndash519

Christoffersen P F (1998) ldquoEvaluating Interval Forecastsrdquo International Eco-nomic Review 39 841ndash862

Collins M Schapire R E and Singer J (2002) ldquoLogistic RegressionAdaBoost and Bregman Distancesrdquo Machine Learning 48 253ndash285

Copas J B (1983) ldquoRegression Prediction and Shrinkagerdquo Journal of theRoyal Statistical Society Ser B 45 311ndash354

Daley D J and Vere-Jones D (2004) ldquoScoring Probability Forecasts forPoint Processes The Entropy Score and Information Gainrdquo Journal of Ap-plied Probability 41A 297ndash312

Dawid A P (1984) ldquoStatistical Theory The Prequential Approachrdquo Journalof the Royal Statistical Society Ser A 147 278ndash292

(1986) ldquoProbability Forecastingrdquo in Encyclopedia of Statistical Sci-ences Vol 7 eds S Kotz N L Johnson and C B Read New York Wileypp 210ndash218

(1998) ldquoCoherent Measures of Discrepancy Uncertainty and Depen-dence With Applications to Bayesian Predictive Experimental Designrdquo Re-search Report 139 University College London Dept of Statistical Science

(2006) ldquoThe Geometry of Proper Scoring Rulesrdquo Research Report268 University College London Dept of Statistical Science

Dawid A P and Sebastiani P (1999) ldquoCoherent Dispersion Criteria for Op-timal Experimental Designrdquo The Annals of Statistics 27 65ndash81

Deacutequeacute M Royer J T and Stroe R (1994) ldquoFormulation of GaussianProbability Forecasts Based on Model Extended-Range Integrationsrdquo TellusSer A 46 52ndash65

Diebold F X and Mariano R S (1995) ldquoComparing Predictive AccuracyrdquoJournal of Business amp Economic Statistics 13 253ndash263

Duffie D and Pan J (1997) ldquoAn Overview of Value at Riskrdquo Journal ofDerivatives 4 7ndash49

Dunsmore I R (1968) ldquoA Bayesian Approach to Calibrationrdquo Journal of theRoyal Statistical Society Ser B 30 396ndash405

Eaton M L (1982) ldquoA Method for Evaluating Improper Prior Distributionsrdquoin Statistical Decision Theory and Related Topics III eds S S Gupta andJ O Berger New York Academic Press pp 329ndash352

Eaton M L Giovagnoli A and Sebastiani P (1996) ldquoA Predictive Approachto the Bayesian Design Problem With Application to Normal RegressionModelsrdquo Biometrika 83 111ndash125

Epstein E S (1969) ldquoA Scoring System for Probability Forecasts of RankedCategoriesrdquo Journal of Applied Meteorology 8 985ndash987

Feuerverger A and Rahman S (1992) ldquoSome Aspects of Probabil-ity Forecastingrdquo Communications in StatisticsmdashTheory and Methods 211615ndash1632

Friederichs P and Hense A (2006) ldquoStatistical Down-Scaling of ExtremePrecipitation Events Using Censored Quantile Regressionrdquo Monthly WeatherReview in press

Friedman D (1983) ldquoEffective Scoring Rules for Probabilistic ForecastsrdquoManagement Science 29 447ndash454

Friedman J H (1989) ldquoRegularized Discriminant Analysisrdquo Journal of theAmerican Statistical Association 84 165ndash175

Garratt A Lee K Pesaran M H and Shin Y (2003) ldquoForecast Uncertain-ties in Macroeconomic Modelling An Application to the UK EconomyrdquoJournal of the American Statistical Association 98 829ndash838

Garthwaite P H Kadane J B and OrsquoHagan A (2005) ldquoStatistical Methodsfor Eliciting Probability Distributionsrdquo Journal of the American StatisticalAssociation 100 680ndash700

Geisser S and Eddy W F (1979) ldquoA Predictive Approach to Model Selec-tionrdquo Journal of the American Statistical Association 74 153ndash160

Gelfand A E and Ghosh S K (1998) ldquoModel Choice A Minimum PosteriorPredictive Loss Approachrdquo Biometrika 85 1ndash11

Gerds T (2002) ldquoNonparametric Efficient Estimation of Prediction Errorfor Incomplete Data Modelsrdquo unpublished doctoral dissertation Albert-Ludwigs-Universitaumlt Freiburg Germany Mathematische Fakultaumlt

Giacomini R and Komunjer I (2005) ldquoEvaluation and Combination of Con-ditional Quantile Forecastsrdquo Journal of Business amp Economic Statistics 23416ndash431

Gneiting T (1998) ldquoSimple Tests for the Validity of Correlation FunctionModels on the Circlerdquo Statistics amp Probability Letters 39 119ndash122

Gneiting T Balabdaoui F and Raftery A E (2006) ldquoProbabilistic ForecastsCalibration and Sharpnessrdquo Journal of the Royal Statistical Society Ser Bin press

Gneiting T and Raftery A E (2005) ldquoWeather Forecasting With EnsembleMethodsrdquo Science 310 248ndash249

Gneiting T Raftery A E Balabdaoui F and Westveld A (2003) ldquoVer-ifying Probabilistic Forecasts Calibration and Sharpnessrdquo presented at theWorkshop on Ensemble Forecasting Val-Morin Queacutebec

Gneiting T Raftery A E Westveld A and Goldman T (2005) ldquoCalibratedProbabilistic Forecasting Using Ensemble Model Output Statistics and Mini-mum CRPS Estimationrdquo Monthly Weather Review 133 1098ndash1118

Good I J (1952) ldquoRational Decisionsrdquo Journal of the Royal Statistical Soci-ety Ser B 14 107ndash114

(1971) Comment on ldquoMeasuring Information and Uncertaintyrdquo byR J Buehler in Foundations of Statistical Inference eds V P Godambeand D A Sprott Toronto Holt Rinehart and Winston pp 337ndash339

Granger C W J (2006) ldquoPreface Some Thoughts on the Future of Forecast-ingrdquo Oxford Bulletin of Economics and Statistics 67S 707ndash711

Grimit E P Gneiting T Berrocal V J and Johnson N A (2006) ldquoTheContinuous Ranked Probability Score for Circular Variables and Its Applica-tion to Mesoscale Forecast Ensemble Verificationrdquo Quarterly Journal of theRoyal Meteorological Society in press

Grimit E P and Mass C F (2002) ldquoInitial Results of a Mesoscale Short-Range Ensemble System Over the Pacific Northwestrdquo Weather and Forecast-ing 17 192ndash205

Gruumlnwald P D and Dawid A P (2004) ldquoGame Theory Maximum EntropyMinimum Discrepancy and Robust Bayesian Decision Theoryrdquo The Annalsof Statistics 32 1367ndash1433

Gschloumlszligl S and Czado C (2005) ldquoSpatial Modelling of Claim Frequencyand Claim Size in Insurancerdquo Discussion Paper 461 Ludwig-Maximilians-Universitaumlt Munich Germany Sonderforschungsbereich 368

Hamill T M (1999) ldquoHypothesis Tests for Evaluating Numerical PrecipitationForecastsrdquo Weather and Forecasting 14 155ndash167

Hamill T M and Wilks D S (1995) ldquoA Probabilistic Forecast Contest andthe Difficulty in Assessing Short-Range Forecast Uncertaintyrdquo Weather andForecasting 10 620ndash631

Hendrickson A D and Buehler R J (1971) ldquoProper Scores for ProbabilityForecastersrdquo The Annals of Mathematical Statistics 42 1916ndash1921

Hersbach H (2000) ldquoDecomposition of the Continuous Ranked Probabil-ity Score for Ensemble Prediction Systemsrdquo Weather and Forecasting 15559ndash570

Hofmann T Schoumllkopf B and Smola A (2005) ldquoA Review of RKHS Meth-ods in Machine Learningrdquo preprint

Huber P J (1964) ldquoRobust Estimation of a Location Parameterrdquo The Annalsof Mathematical Statistics 35 73ndash101

(1967) ldquoThe Behavior of Maximum Likelihood Estimates Under Non-Standard Conditionsrdquo in Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam and J NeymanBerkeley CA University of California Press pp 221ndash233

(1981) Robust Statistics New York WileyJeffreys H (1939) Theory of Probability Oxford UK Oxford University

PressJolliffe I T (2006) ldquoUncertainty and Inference for Verification Measuresrdquo

Weather and Forecasting in pressJolliffe I T and Stephenson D B (eds) (2003) Forecast Verification

A Practicionerrsquos Guide in Atmospheric Science Chichester UK WileyKabaila P (1999) ldquoThe Relevance Property for Prediction Intervalsrdquo Journal

of Time Series Analysis 20 655ndash662Kabaila P and He Z (2001) ldquoOn Prediction Intervals for Conditionally Het-

eroscedastic Processesrdquo Journal of Time Series Analysis 22 725ndash731Kass R E and Raftery A E (1995) ldquoBayes Factorsrdquo Journal of the American

Statistical Association 90 773ndash795Knorr-Held L and Rainer E (2001) ldquoProjections of Lung Cancer in West

Germany A Case Study in Bayesian Predictionrdquo Biostatistics 2 109ndash129Koenker R and Bassett G (1978) ldquoRegression Quantilesrdquo Econometrica 46

33ndash50

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482

Page 20: Strictly Proper Scoring Rules, Prediction, and Estimation...a predictive distribution (Bernardo and Smith 1994). We take scoring rules to be positively oriented rewards that a forecaster

378 Journal of the American Statistical Association March 2007

Koenker R and Machado J A F (1999) ldquoGoodness-of-Fit and Related Infer-ence Processes for Quantile Regressionrdquo Journal of the American StatisticalAssociation 94 1296ndash1310

Kohonen J and Suomela J (2006) ldquoLessons Learned in the Challenge Mak-ing Predictions and Scoring Themrdquo in Machine Learning Challenges Eval-uating Predictive Uncertainty Visual Object Classification and RecognizingTextual Entailment eds J Quinonero-Candela I Dagan B Magnini andF drsquoAlcheacute-Buc Berlin Springer-Verlag pp 95ndash116

Koldobskiı A L (1992) ldquoSchoenbergrsquos Problem on Positive Definite Func-tionsrdquo St Petersburg Mathematical Journal 3 563ndash570

Krzysztofowicz R and Sigrest A A (1999) ldquoComparative Verification ofGuidance and Local Quantitative Precipitation Forecasts Calibration Analy-sesrdquo Weather and Forecasting 14 443ndash454

Langland R H Toth Z Gelaro R Szunyogh I Shapiro M A MajumdarS J Morss R E Rohaly G D Velden C Bond N and BishopC H (1999) ldquoThe North Pacific Experiment (NORPEX-98) Targeted Ob-servations for Improved North American Weather Forecastsrdquo Bulletin of theAmerican Meteorological Society 90 1363ndash1384

Laud P W and Ibrahim J G (1995) ldquoPredictive Model Selectionrdquo Journalof the Royal Statistical Society Ser B 57 247ndash262

Lehmann E and Casella G (1998) Theory of Point Estimation (2nd ed)New York Springer

Liu R Y (1990) ldquoOn a Notion of Data Depth Based on Random SimplicesrdquoThe Annals of Statistics 18 405ndash414

Ma C (2003) ldquoNonstationary Covariance Functions That Model SpacendashTimeInteractionsrdquo Statistics amp Probability Letters 61 411ndash419

Mason S J (2004) ldquoOn Using Climatology as a Reference Strategy in theBrier and Ranked Probability Skill Scoresrdquo Monthly Weather Review 1321891ndash1895

Matheron G (1984) ldquoThe Selectivity of the Distributions and the lsquoSecondPrinciple of Geostatisticsrsquo rdquo in Geostatistics for Natural Resources Charac-terization eds G Verly M David and A G Journel Dordrecht Reidelpp 421ndash434

Matheson J E and Winkler R L (1976) ldquoScoring Rules for ContinuousProbability Distributionsrdquo Management Science 22 1087ndash1096

Mattner L (1997) ldquoStrict Definiteness via Complete Monotonicity of Inte-gralsrdquo Transactions of the American Mathematical Society 349 3321ndash3342

McCarthy J (1956) ldquoMeasures of the Value of Informationrdquo Proceedings ofthe National Academy of Sciences 42 654ndash655

Murphy A H (1973) ldquoHedging and Skill Scores for Probability ForecastsrdquoJournal of Applied Meteorology 12 215ndash223

Murphy A H and Winkler R L (1992) ldquoDiagnostic Verification of Proba-bility Forecastsrdquo International Journal of Forecasting 7 435ndash455

Nau R F (1985) ldquoShould Scoring Rules Be lsquoEffectiversquordquo Management Sci-ence 31 527ndash535

Palmer T N (2002) ldquoThe Economic Value of Ensemble Forecasts as a Toolfor Risk Assessment From Days to Decadesrdquo Quarterly Journal of the RoyalMeteorological Society 128 747ndash774

Pepe M S (2003) The Statistical Evaluation of Medical Tests for Classifica-tion and Prediction Oxford UK Oxford University Press

Perlman M D (1972) ldquoOn the Strong Consistency of Approximate MaximumLikelihood Estimatorsrdquo in Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability I eds L M Le Cam J Neyman andE L Scott Berkeley CA University of California Press pp 263ndash281

Pfanzagl J (1969) ldquoOn the Measurability and Consistency of Minimum Con-trast Estimatesrdquo Metrika 14 249ndash272

Potts J (2003) ldquoBasic Conceptsrdquo in Forecast Verification A PracticionerrsquosGuide in Atmospheric Science eds I T Jolliffe and D B Stephenson Chich-ester UK Wiley pp 13ndash36

Quintildeonero-Candela J Rasmussen C E Sinz F Bousquet O andSchoumllkopf B (2006) ldquoEvaluating Predictive Uncertainty Challengerdquo in Ma-chine Learning Challenges Evaluating Predictive Uncertainty Visual Ob-ject Classification and Recognizing Textual Entailment eds J Quinonero-Candela I Dagan B Magnini and F drsquoAlcheacute-Buc Berlin Springerpp 1ndash27

Raftery A E Gneiting T Balabdaoui F and Polakowski M (2005) ldquoUs-ing Bayesian Model Averaging to Calibrate Forecast Ensemblesrdquo MonthlyWeather Review 133 1155ndash1174

Rockafellar R T (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Roulston M S and Smith L A (2002) ldquoEvaluating Probabilistic ForecastsUsing Information Theoryrdquo Monthly Weather Review 130 1653ndash1660

Savage L J (1971) ldquoElicitation of Personal Probabilities and ExpectationsrdquoJournal of the American Statistical Association 66 783ndash801

Schervish M J (1989) ldquoA General Method for Comparing Probability Asses-sorsrdquo The Annals of Statistics 17 1856ndash1879

Schumacher M Graf E and Gerds T (2003) ldquoHow to Assess PrognosticModels for Survival Data A Case Study in Oncologyrdquo Methods of Informa-tion in Medicine 42 564ndash571

Schwarz G (1978) ldquoEstimating the Dimension of a Modelrdquo The Annals ofStatistics 6 461ndash464

Selten R (1998) ldquoAxiomatic Characterization of the Quadratic Scoring RulerdquoExperimental Economics 1 43ndash62

Shuford E H Albert A and Massengil H E (1966) ldquoAdmissible Probabil-ity Measurement Proceduresrdquo Psychometrika 31 125ndash145

Smyth P (2000) ldquoModel Selection for Probabilistic Clustering Using Cross-Validated Likelihoodrdquo Statistics and Computing 10 63ndash72

Spiegelhalter D J Best N G Carlin B R and van der Linde A (2002)ldquoBayesian Measures of Model Complexity and Fitrdquo (with discussion and re-joinder) Journal of the Royal Statistical Society Ser B 64 583ndash616

Staeumll von Holstein C-A S (1970) ldquoA Family of Strictly Proper ScoringRules Which Are Sensitive to Distancerdquo Journal of Applied Meteorology9 360ndash364

(1977) ldquoThe Continuous Ranked Probability Score in Practicerdquo in De-cision Making and Change in Human Affairs eds H Jungermann and G deZeeuw Dordrecht Reidel pp 263ndash273

Szeacutekely G J (2003) ldquoE-Statistics The Energy of Statistical Samplesrdquo Tech-nical Report 2003-16 Bowling Green State University Dept of Mathematicsand Statistics

Szeacutekely G J and Rizzo M L (2005) ldquoA New Test for Multivariate Normal-ityrdquo Journal of Multivariate Analysis 93 58ndash80

Taylor J W (1999) ldquoEvaluating Volatility and Interval Forecastsrdquo Journal ofForecasting 18 111ndash128

Tetlock P E (2005) Political Expert Judgement Princeton NJ Princeton Uni-versity Press

Theis S (2005) ldquoDeriving Probabilistic Short-Range Forecasts From aDeterministic High-Resolution Modelrdquo unpublished doctoral dissertationRheinische Friedrich-Wilhelms-Universitaumlt Bonn Germany Mathematisch-Naturwissenschaftliche Fakultaumlt

Toth Z Zhu Y and Marchok T (2001) ldquoThe Use of Ensembles to IdentifyForecasts With Small and Large Uncertaintyrdquo Weather and Forecasting 16463ndash477

Unger D A (1985) ldquoA Method to Estimate the Continuous Ranked Probabil-ity Scorerdquo in Preprints of the Ninth Conference on Probability and Statisticsin Atmospheric Sciences Virginia Beach Virginia Boston American Mete-orological Society pp 206ndash213

Wald A (1949) ldquoNote on the Consistency of the Maximum Likelihood Esti-materdquo The Annals of Mathematical Statistics 20 595ndash601

Weigend A S and Shi S (2000) ldquoPredicting Daily Probability Distributionsof SampP500 Returnsrdquo Journal of Forecasting 19 375ndash392

Wilks D S (2002) ldquoSmoothing Forecast Ensembles With Fitted ProbabilityDistributionsrdquo Quarterly Journal of the Royal Meteorological Society 1282821ndash2836

(2006) Statistical Methods in the Atmospheric Sciences (2nd ed)Amsterdam Elsevier

Wilson L J Burrows W R and Lanzinger A (1999) ldquoA Strategy for Verifi-cation of Weather Element Forecasts From an Ensemble Prediction SystemrdquoMonthly Weather Review 127 956ndash970

Winkler R L (1969) ldquoScoring Rules and the Evaluation of Probability Asses-sorsrdquo Journal of the American Statistical Association 64 1073ndash1078

(1972) ldquoA Decision-Theoretic Approach to Interval Estimationrdquo Jour-nal of the American Statistical Association 67 187ndash191

(1994) ldquoEvaluating Probabilities Asymmetric Scoring Rulesrdquo Man-agement Science 40 1395ndash1405

(1996) ldquoScoring Rules and the Evaluation of Probabilitiesrdquo (with dis-cussion and reply) Test 5 1ndash60

Winkler R L and Murphy A H (1968) ldquolsquoGoodrsquo Probability AssessorsrdquoJournal of Applied Meteorology 7 751ndash758

(1979) ldquoThe Use of Probabilities in Forecasts of Maximum and Min-imum Temperaturesrdquo Meteorological Magazine 108 317ndash329

Zastavnyi V P (1993) ldquoPositive Definite Functions Depending on the NormrdquoRussian Journal of Mathematical Physics 1 511ndash522

Zuo Y and Serfling R (2000) ldquoGeneral Notions of Statistical Depth Func-tionsrdquo The Annals of Statistics 28 461ndash482