201101 evaluation jmlt postprint colour

Upload: david-m-w-powers

Post on 05-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    1/27

    Journal of Machine Learning Technologies

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011, pp-37-63

    Available online at http://www.bioinfo.in/contents.php?id=51

    Copyright 2011 Bioinfo Publications 37

    EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC,INFORMEDNESS, MARKEDNESS & CORRELATION

    POWERS, David M. W.*AILab, School of Computer Science, Engineering and Mathematics, Flinders University, South Australia, AustraliaCorresponding author. Email: [email protected]

    Received: February 18, 2011; Accepted: February 27, 2011

    Abstract - Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy arebiased and should not be used without clear understanding of the biases, and corresponding identification of chanceor base case levels of the statistic. Using these measures a system that performs worse in the objective sense ofInformedness, can appear to perform better under any of these commonly used measures. We discuss severalconcepts and measures that reflect the probability that prediction is informed versus chance. Informedness andintroduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally wedemonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significanceas well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous caseto the general multi-class case.

    Keywords Recall and Precision, F-Measure, Rand Accuracy, Kappa, Informedness and Markedness, DeltaP,Correlation, Significance.

    INTRODUCTION

    A common but poorly motivated way of evaluatingresults of Machine Learning experiments is usingRecall, Precision and F-measure. These measures

    are named for their origin in Information Retrieval andpresent specific biases, namely that they ignoreperformance in correctly handling negative examples,they propagate the underlying marginal prevalencesand biases, and they fail to take account the chancelevel performance. In the Medical Sciences, ReceiverOperating Characteristics (ROC) analysis has beenborrowed from Signal Processing to become astandard for evaluation and standard setting,comparing True Positive Rate and False PositiveRate. In the Behavioural Sciences, Specificity andSensitivity, are commonly used. Alternate techniques,such as Rand Accuracy and Cohen Kappa, have

    some advantages but are nonetheless still biasedmeasures. We will recapitulate some of the literaturerelating to the problems with these measures, as wellas considering a number of other techniques thathave been introduced and argued within each ofthese fields, aiming/claiming to address the problemswith these simplistic measures.

    This paper recapitulates and re-examines therelationships between these various measures,develops new insights into the problem of measuringthe effectiveness of an empirical decision system or ascientific experiment, analyzing and introducing newprobabilistic and information theoretic measures thatovercome the problems with Recall, Precision andtheir derivatives.

    THE BINARY CASE

    It is common to introduce the various measures in thecontext of a dichotomous classification problem,where the labels are by convention + and - and thepredictions of a classifier are summarized in a four-cell contingency table. This may be expressed usingraw counts of the number of times each predictedlabel is associated with each real class, or may beexpressed in relative terms. Cell and margin labelsmay be formal probability expressions, may derivecell expressions from margin labels or vice-versa,may use alphabetic constant labels a, b, c, d orA,B, C, D, or letter codes for the terms as True andFalse, Real and Predicted, Positives and Negatives.

    Often UPPER CASE is used where the values arecounts, and lower case letters where the values areprobabilities or proportions relative to N or the

    marginal probabilities we will adopt this conventionthroughout this paper (always written intypewriter font), and in addition will useMixed Case (in the normal text font) for popularnomenclature that may or may not corresponddirectly to one of our formal systematic names. Trueand False Positives (TP/FP) refer to the number ofPredicted Positives that were correct/incorrect, andsimilarly for True and False Negatives (TN/FN), andthese four cells sum to N. On the other hand tp, fp,fn, tn and rp, rn and pp, pn refer to the jointand marginal probabilities, and the four contingencycells and the two pairs of marginal probabilities each

    sum to 1. We will attach other popular names to someof these probabilities in due course.

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    2/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    38 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    We thus make the specific assumptions that we arepredicting and assessing a single condition that iseither positive or negative (dichotomous), that wehave one predicting model, and one gold standardlabeling. Unless otherwise noted we will also forsimplicity assume that the contingency is non-trivial in

    the sense that both positive and negative states ofboth predicted and real conditions occur, so that noneof the marginal sums or probabilities is zero.

    We illustrate in Table 1 the general form of a binarycontingency table using both the traditional alphabeticnotation and the directly interpretable systematicapproach. Both definitions and derivations in thispaper are made relative to these labellings, althoughEnglish terms (e.g. from Information Retrieval) willalso be introduced for various ratios and probabilities.The green positive diagonal represents correctpredictions, and the pink negative diagonal incorrect

    predictions. The predictions of the contingency tablemay be the predictions of a theory, of somecomputational rule or system (e.g. an Expert Systemor a Neural Network), or may simply be a directmeasurement, a calculated metric, or a latentcondition, symptom or marker. We will refergenerically to "the model" as the source of thepredicted labels, and "the population" or "the world"as the source of the real conditions. We areinterested in understanding to what extent the model"informs" predictions about the world/population, andthe world/population "marks" conditions in the model.

    Recall & Precision, Sensitivity & SpecificityRecall or Sensitivity (as it is called in Psychology) isthe proportion of Real Positive cases that arecorrectly Predicted Positive. This measures theCoverage of the Real Positive cases by the +P(Predicted Positive) rule. Its desirable feature is that itreflects how many of the relevant cases the +P rulepicks up. It tends not to be very highly valued inInformation Retrieval (on the assumptions that thereare many relevant documents, that it doesn't reallymatter which subset we find, that we can't knowanything about the relevance of documents that aren'treturned). Recall tends to be neglected or averagedaway in Machine Learning and ComputationalLinguistics (where the focus is on how confident wecan be in the rule or classifier). However, in aComputational Linguistics/Machine Translation

    context Recall has been shown to have a majorweight in predicting the success of Word Alignment[1]. In a Medical context Recall is moreover regardedas primary, as the aim is to identify all Real Positivecases, and it is also one of the legs on which ROCanalysis stands. In this context it is referred to as

    True Positive Rate (tpr). Recall is defined, with itsvarious common appellations, by equation (1):

    Recall = Sensitivity = tpr = tp/rp= TP / RP = A /(A+C) (1)

    Conversely, Precision or Confidence (as it is called inData Mining) denotes the proportion of PredictedPositive cases that are correctly Real Positives. Thisis what Machine Learning, Data Mining andInformation Retrieval focus on, but it is totally ignoredin ROC analysis. It can however analogously becalled True Positive Accuracy (tpa), being ameasure of accuracy of Predicted Positives incontrast with the rate of discovery of Real Positives(tpr). Precision is defined in (2):

    Precision = Confidence =tpa=tp/pp=TP / PP = A /(A+B) (2)

    These two measures and their combinations focusonly on the positive examples and predictions,although between them they capture someinformation about the rates and kinds of errors made.However, neither of them captures any informationabout how well the model handles negative cases.Recall relates only to the +R column and Precisiononly to the +P row. Neither of these takes intoaccount the number of True Negatives. This alsoapplies to their Arithmetic, Geometric and HarmonicMeans: A, Gand F=G2/A (the F-factor or F-measure).Note that the F1-measure effectively references theTrue Positives to the Arithmetic Mean of PredictedPositives and Real Positives, being a constructed ratenormalized to an idealized value, and expressed inthis form it is known in statistics as a Proportion ofSpecific Agreement as it is a applied to a specificclass, so applied to the Positive Class, it is PS+. Italso corresponds to the set-theoretic Dice Coefficient.The Geometric Mean of Recall and Precision (G-

    measure) normalizes TP to the Geometric Mean ofPredicted Positives and Real Positives, and itsInformation content corresponds to the ArithmeticMean Information represented by Recall and Precision.

    Table 1. Systematic and traditional notations in a binary contingency table. Shading indicates correct(light=green) and incorrect (dark=red) rates or counts in the contingency table.

    +R R +R R

    tp fp pp +P A B A+B

    fn tn pn P C D C+D

    rp rn 1 A+C B+D N

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    3/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 39

    In fact, there is in principle nothing special about thePositive case, and we can define Inverse statistics interms of the Inverse problem in which we interchangepositive and negative and are predicting the oppositecase. Inverse Recall or Specificity is thus theproportion of Real Negative cases that are correctly

    Predicted Negative (3), and is also known as the TrueNegative Rate (tnr). Conversely, Inverse Precisionis the proportion of Predicted Negative cases that areindeed Real Negatives (4), and can also be calledTrue Negative Accuracy (tna):

    Inverse Recall =tnr =tn/rn=TN/RN =D/(B+D) (3)

    Inverse Precision =tna =tn/pn=TN/PN =D/(C+D) (4)

    The inverse of F1 is not known in AI/ML/CL/IR but isjust as well known as PS+ in statistics,being theProportion of Specific Agreement for the class ofnegatives, PS. Note that where as F1 is advocatedin AI/ML/CL/IR as a single measure to capture theeffectiveness of a system, it still completely ignoresTN which can vary freely without affecting thestatistic. In statistics, PS+ is used in conjunction withPS to ensure the contingencies are completelycaptured, and similarly Specificity (Inverse Recall) isalways recorded along with Sensitivity (Recall).

    Rand Accuracy explicitly takes into account theclassification of negatives, and is expressible (5) bothas a weighted average of Precision and InversePrecision and as a weighted average of Recall and

    Inverse Recall:

    Accuracy =tca=tcr=tp+tn=rptpr+rntnr =(TP+TN)/N=pptpa+pntna =(A+D)/N (5)

    Dice = F1 =tp/(tp+(fn+fp)/2)=A/(A+(B+C)/2) (6)=1/(1+mean(FN,FP)/TP)

    Jaccard =tp/(tp+fn+fp)=TP/(N-TN)=A/(A+B+C) = A/(N-D) (7)=1/(1+2mean(FN,FP)/TP)= F1 / (2 F1)

    As shown in (5) Rand Accuracy is effectively a

    prevalence-weighted average of Recall and InverseRecall, as well as a bias-weighted average ofPrecision and Inverse Precision. Whilst it does takeinto account TN in the numerator, the sensitivity tobias and prevalence is an issue since these areindependent variables, with prevalence varying as weapply to data sampled under different conditions, andbias being directly under the control of the systemdesigner (e.g. as a threshold). Similarly, we can notethat one of N,FP or FN is free to vary. Whilst itapparently takes into account TN in thenumerator,theJaccard (or Tanimoto) similaritycoefficient uses it to heuristicallydiscount the correct

    classification of negatives, but it can be written (6)independently of FN and N in a way similar tothe

    effectively equivalent Dice or PS+ or F1 (7), or interms of them, and so is subject to bias as FN orN isfree to vary and theyfail to capture contingencies fullywithout knowing inverse statisticstoo.

    Each of the above also has a complementary form

    defining an error rate, of which some have specificnames and importance: Fallout or False Positive Rate(fpr) are the proportion of Real Negatives that occuras Predicted Positive (ring-ins); Miss Rate or FalseNegative Rate (fnr) are the proportion of RealPositives that are Predicted Negatives (false-drops).False Positive Rate is the second of the legs onwhich ROC analysis is based.

    Fallout =fpr =fp/rp=FP/RP =B/(B+D) (8)

    Miss Rate =fnr =fn/rn=FN/RN =C/(A+C) (9)

    Note that FN and FP are sometimes referred to asType I and Type II Errors, and the rates fn and fpas alpha and beta, respectively referring to falselyrejecting or accepting a hypothesis. More correctly,these terms apply specifically to the meta-levelproblem discussed later of whether the precisepattern of counts (not rates) in the contingency tablefit the null hypothesis of random distribution ratherthan reflecting the effect of some alternativehypothesis (which is not in general the onerepresented by +P +R orP -R or both).Note that all the measures discussed individuallyleave at least two degree of freedom (plus N)

    unspecified and free to control, and this leaves thedoor open for bias, whilst N is needed too forestimating significance and power.

    Prevalence, Bias, Cost & Skew

    We now turn our attention to the various forms of biasthat detract from the utility of all of the above surfacemeasures [2]. We will first note that rp representsthe Prevalence of positive cases, RP/N, and isassumed to be a property of the population of interest it may be constant, or it may vary acrosssubpopulations, but is regarded here as not beingunder the control of the experimenter, and so wewant a prevalence independent measure. Bycontrast, pp represents the (label) Bias of the model[3], the tendency of the model to output positivelabels, PP/N, and is directly under the control of theexperimenter, who can change the model bychanging the theory or algorithm, or some parameteror threshold, to better fit the world/population beingmodeled. As discussed earlier, F-factor (or Dice orJaccard) effectively references tp (probability orproportion of True Positives) to the Arithmetic Meanof Bias and Prevalence (6-7). A common rule ofthumb, or even a characteristic of some algorithms, isto parameterize a model so that Prevalence = Bias,viz. rp = pp. Corollaries of this setting are Recall

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    4/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    40 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    = Precision (= Dice but not Jaccard), Inverse Recall= Inverse Precision and Fallout = Miss Rate.

    Alternate characterizations of Prevalence are in termsof Odds[4] or Skew [5], being the Class Ratio cs=rn/rp, recalling that by definition rp+rn = 1

    and RN+RP = N. If the distribution is highlyskewed, typically there are many more negativecases than positive, this means the number of errorsdue to poor Inverse Recall will be much greater thanthe number of errors due to poor Recall. Given thecost of both False Positives and False Negatives isequal, individually, the overall component of the totalcost due to False Positives (as Negatives) will bemuch greater at any significant level of chanceperformance, due to the higher Prevalence of RealNegatives.

    Note that the normalized binary contingency tablewith unspecified margins has three degrees offreedom setting any three nonRedundant ratiosdetermines the rest (setting any count supplies theremaining information to recover the original table ofcounts with its four degrees of freedom). In particular,Recall, Inverse Recall and Prevalence, orequivalently tpr, fpr and cs, suffice to determine allratios and measures derivable from the normalizedcontingency table, but N is also required todetermine significance. As another case of specificinterest, Precision, Inverse Precision and Bias, incombination, suffice to determine all ratios ormeasures, although we will show later that analternate characterization of Prevalence and Bias interms of Evenness allows for even simplerrelationships to be exposed.

    We can also take into account a differential value forpositives (cp) and negatives (cn) this can beapplied to errors as a cost (loss or debit) and/or tocorrect cases as a gain (profit or credit), and can becombined into a single Cost Ratio cv= cn/cp.Note that the value and skew determined costs havesimilar effects, and may be multiplied to produce asingle skew-like cost factorc = cvcs. Formulationsof measures that are expressed using tpr, fpr and csmay be made cost-sensitive by using c = cvcs in

    place of c = cs, or can be made skew/cost-insensitive by using c = 1[5].

    ROC and PN Analyses

    Flach [5] highlighted the utility of ROC analysis to theMachine Learning community, and characterized theskew sensitivity of many measures in that context,utilizing the ROC format to give geometric insightsinto the nature of the measures and their sensitivity toskew. [6] further elaborated this analysis, extending itto the unnormalized PN variant of ROC, and targetingtheir analysis specifically to rule learning. We will not

    examine the advantages of ROC analysis here, but

    will briefly explain the principles and recapitulatesome of the results.

    ROC analysis plots the rate tpr against the ratefpr, whilst PN plots the unnormalized TP against

    FP. This difference in normalization only changes thescales and gradients, and we will deal only with thenormalized form of ROC analysis. A perfect classifierwill score in the top left hand corner(fpr=0,tpr=100%). A worst case classifier willscore in the bottom right hand corner(fpr=100%,tpr=0). A random classifier wouldbe expected to score somewhere along the positivediagonal (tpr=fpr) since the model will throw uppositive and negative examples at the same rate

    (relative to their populations these are Recall-likescales: tpr = Recall, 1-fpr = Inverse Recall).For the negative diagonal (tpr+cfpr=1)corresponds to matching Bias to Prevalence for askew ofc.

    The ROC plot allows us to compare classifiers(models and/or parameterizations) and choose theone that is closest to (0,1) and furtherest fromtpr=fpr in some sense. These conditions forchoosing the optimal parameterization or model arenot identical, and in fact the most common conditionis to minimize the area under the curve (AUC), which

    for a single parameterization of a model is defined bya single point and the segments connecting it to (0,0)

    Figure 1. Illustration of ROC Analysis. The maindiagonal represents chance with parallel isocost linesrepresenting equal cost-performance. Points above

    the diagonal represent performance better thanchance, those below worse than chance. For a singlegood (dotted=green) system, AUC is area under curve

    (trapezoid between green line and x=[0,1] ).The perverse (dashed=red) system shown is the same

    (good) system with class labels reversed.

    besttpr

    fpr

    worst

    good

    perverse

    sse0

    B

    sse1

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    5/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 41

    and (1,1). For a parameterized model it will be amonotonic function consisting of a sequence ofsegments from (0,0) to (1,1). A particular cost modeland/or accuracy measure defines an isocost gradient,which for a skew and cost insensitive model will be

    c=1, and hence another common approach is to

    choose a tangent point on the highest isocost line thattouches the curve. The simple condition of choosingthe point on the curve nearest the optimum point (0,1)is not commonly used, but this distance to (0,1) isgiven by [(-fpr)2+ (1-tpr)2], andminimizing this amounts to minimizing the sum ofsquared normalized error, fpr2+fnr2.

    A ROC curve with concavities can also be locallyinterpolated to produce a smoothed model followingthe convex hull of the original ROC curve. It is evenpossible to locally invert across the convex hull torepair concavities, but this may overfit and thus not

    generalize to unseen data. Such repairs can lead toselecting an improved model, and the ROC curve canalso be used to return a model to changingPrevalence and costs. The area under such amultipoint curve is thus of some value, but theoptimum in practice is the area under the simpletrapezoid defined by the model:

    AUC = (tpr-fpr+1)/2= (tpr+tnr)/2= 1 (fpr+fnr)/2 (10)

    For the cost and skew insensitive case, with c=1,maximizing AUC is thus equivalent to maximizing

    tpr-fpr or minimizing a sum of (absolute)normalized error fpr+fnr. The chance linecorresponds to tpr-fpr=0, and parallel isocostlines for c=1 have the form tpr-fpr=k. Thehighest isocost line also maximizes tpr-fpr andAUC so that these two approaches are equivalent.Minimizing a sum of squared normalized error,

    fpr2+fnr2, corresponds to a Euclidean distanceminimization heuristic that is equivalent only underappropriate constraints, e.g. fpr=fnr, orequivalently, Bias=Prevalence, noting that all cells arenon-negative by construction.

    We now summarize relationships between the various

    candidate accuracy measures as rewritten [5,6] interms of tpr, fpr and the skew, c, as well interms of Recall, Bias and Prevalence:

    Accuracy = [tpr+c(1-fpr)]/[1+c]= 2RecallPrev+1-BiasPrev (11)

    Precision = tpr/[tpr+cfpr]= RecallPrev/Bias (12)

    F-Measure F1 = 2tpr/[tpr+cfpr+1]= 2RecallPrev/[Bias+Prev] (13)

    WRacc = 4c[tpr-fpr]/[1+c]2= 4[Recall-Bias]Prev (14)

    The last measure, Weighted Relative Accuracy, was

    defined [7] to subtract off the component of the TruePositive score that is attributable to chance and

    rescale to the range 1. Note that maximizingWRaccis equivalent to maximizing AUC or tpr-fpr=2AUC1, as c is constant. Thus WRAcc is anunbiased accuracy measure, and the skew-insensitive form of WRAcc, with c=1, is precisely

    tpr-fpr. Each of the other measures (1012)

    shows a bias in that it can not be maximizedindependent of skew, although skew-insensitiveversions can be defined by setting c=1. Therecasting of Accuracy, Precision and F-Measure interms of Recall makes clear how all of these vary onlyin terms of the way they are affected by Prevalenceand Bias.

    Prevalence is regarded as a constant of the targetcondition or data set (and c=[1Prev]/Prev),whilst parameterizing or selecting a model can beviewed in terms of trading offtpr and fpr as inROC analysis, or equivalently as controlling the

    relative number of positive and negative predictions,namely the Bias, in order to maximize a particularaccuracy measure (Recall, Precision, F-Measure,Rand Accuracy and AUC). Note that for a givenRecall level, the other measures (1013) all decreasewith increasing Bias towards positive predictions.

    DeltaP, Informedness and Markedness

    Powers [4] also derived an unbiased accuracymeasure to avoid the bias of Recall, Precision andAccuracy due to population Prevalence and labelbias. The Bookmaker algorithm costs wins and lossesin the same way a fair bookmaker would set pricesbased on the odds. Powers then defines the conceptof Informedness which represents the 'edge' a punterhas in making his bet, as evidenced and quantified byhis winnings. Fair pricing based on correct oddsshould be zero sum that is, guessing will leave youwith nothing in the long run, whilst a punter withcertain knowledge will win every time. Informednessis the probability that a punter is making an informedbet and is explained in terms of the proportion of thetime the edge works out versus ends up being pureguesswork. Powers defined BookmakerInformedness for the general, K-label, case, but wewill defer discussion of the general case for now andpresent a simplified formulation of Informedness, aswell as the complementary concept of Markedness.

    Definition 1

    Informedness quantifies how informed a

    predictor is for the specified condition, and

    specifies the probability that a prediction is

    informed in relation to the condition (versus

    chance).

    Definition 2

    Markedness quantifies how marked a

    condition is for the specified predictor, and

    specifies the probability that a condition ismarked by the predictor (versus chance).

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    6/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    42 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    These definitions are aligned with the psychologicaland linguistic uses of the terms condition and marker.The condition represents the experimental outcomewe are trying to determine by indirect means. Amarker or predictor (cf. biomarker or neuromarker)represents the indicator we are using to determine

    the outcome. There is no implication of causality that is something we will address later. However thereare two possible directions of implication we willaddress now. Detection of the predictor may reliablypredict the outcome, with or without the occurrence ofa specific outcome condition reliably triggering thepredictor.

    For the binary case we have

    Informedness = Recall + Inverse Recall 1= tpr-fpr = 1-fnr-fpr (15)

    Markedness = Precision + Inverse Precision 1= tpa-fna = 1-fpa-fna

    We noted above that maximizing AUC or theunbiased WRAcc measure effectively maximized tpr-fpr and indeed WRAcc reduced to this in the skewindependent case. This is not surprising given bothPowers [4] and Flach [5-7] set out to produce anunbiased measure, and the linear definition ofInformedness will define a unique linear form. Notethat while Informedness is a deep measure of howconsistently the Predictor predicts the Outcome bycombining surface measures about what proportion ofOutcomes are correctly predicted, Markedness is adeep measure of how consistently the Outcome has

    the Predictor as a Marker by combining surfacemeasures about what proportion of Predictions arecorrect.

    In the Psychology literature, Markedness is known asDeltaP and is empirically a good predictor of humanassociative judgements that is it seems we developassociative relationships between a predictor and anoutcome when DeltaP is high, and this is true evenwhen multiple predictors are in competition [8]. In thecontext of experiments on information use in syllableprocessing, [9] notes that Schanks [8] sees DeltaP as"the normative measure of contingency", but propose

    a complementary, backward, additional measure ofstrength of association, DeltaP' aka dichotomousInformedness. Perruchet and Peeremant [9] alsonote the analog of DeltaP to regression coefficient,and that the Geometric Mean of the two measures isa dichotomous form of the Pearson correlationcoefficient, the Matthews' Correlation Coefficient,which is appropriate unless a continuous scale isbeing measured dichotomously in which case aTetrachoric Correlation estimate would be appropriate[10,11].

    Causality, Correlation and Regression

    In a linear regression of two variables, we seek topredict one variable, y, as a linear combination of the

    other, x, finding a line of best fit in the sense ofminimizing the sum of squared error (in y). Theequation of fit has the form

    y= y0 + rxx whererx= [nxy-xy]/[nx2-xx] (16)

    Substituting in counts from the contingency table, forthe regression of predicting +R (1) versus-R (0)given +P (1) versus-P (0), we obtain this gradient ofbest fit (minimizing the error in the real values R):

    rP = [ADBC] / [(A+B)(C+D)]= A/(A+B) C/(C+D)= DeltaP = Markedness (17)

    Conversely, we can find the regression coefficient forpredicting P from R (minimizing the error in thepredictions P):

    rR = [ADBC] / [(A+C)(B+D)]

    = A/(A+C) B/(B+D)= DeltaP' = Informedness (18)

    Finally we see that the Matthews correlation, acontingency matrix method of calculating the Pearsonproduct-moment correlation coefficient, , is definedby

    rG =[ADBC]/[(A+C)(B+D)(A+B)(C+D)]=Correlation=[InformednessMarkedness] (19)

    Given the regressions find the same line of best fit,these gradients should be reciprocal, defining a

    perfect Correlation of 1. However, both Informednessand Markedness are probabilities with an upperbound of 1, so perfect correlation requires perfectregression. The squared correlation is a coefficient ofproportionality indicating the proportion of thevariance in R that is explained by P, and istraditionally also interpreted as a probability. We cannow interpret it either as the joint probability that Pinforms R and R marks P, given that the twodirections of predictability are independent, or as theprobability that the variance is (causally) explainedreciprocally. The sign of the Correlation will be thesame as the sign of Informedness and Markedness

    and indicates whether a correct or perverse usage ofthe information has been made take note ininterpreting the final part of (19).

    Psychologists traditionally explain DeltaP in terms ofcausal prediction, but it is important to note that thedirection of stronger prediction is not necessarily thedirection of causality, and the fallacy of abductivereasoning is that the truth of A B does not ingeneral have any bearing on the truth of B A.

    IfPi is one of several independent possible causesofR, PiR is strong, but R Pi is in generalweak for any specific Pi. If Pi is one of several

    necessary contributing factors to R, PiR is weakfor any single Pi, but R Pi is strong. The

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    7/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 43

    directions of the implication are thus not in generaldependent.

    In terms of the regression to fit R from P, since thereare only two correct points and two error points, anderrors are calculated in the vertical (R) direction only,

    all errors contribute equally to tilting the regressiondown from the ideal line of fit. This Markednessregression thus provides information about theconsistency of the Outcome in terms of having thePredictor as a Marker the errors measured from theOutcome R relate to the failure of the MarkerP to bepresent.

    We can gain further insight into the nature of theseregression and correlation coefficients by reducingthe top and bottom of each expression to probabilities(dividing by N2, noting that the original contingencycounts sum to N, and the joint probabilities after

    reduction sum to 1). The numerator is thedeterminant of the contingency matrix, and commonacross all three coefficients, reducing to dtp, whilstthe reduced denominator of the regressioncoefficients depends only on the Prevalence or Biasof the base variates. The regression coefficients,Bookmaker Informedness (B) and Markedness (M),may thus be re-expressed in terms of Precision (Prec)or Recall, along with Bias and Prevalence (Prev) ortheir inverses (I-):

    M = dtp/ [Bias (1-Bias)]= dtp/ [pppn] = dtp / pg2

    = dtp / BiasG2

    = dtp / EvennessP= [Precision Prevalence] / IBias (20)

    B = dtp/ [Prevalence (1Prevalence)]= dtp/ [rprn] = dtp / rg2= dtp / PrevG2= dtp / EvennessR

    = [Recall Bias] / IPrev

    = Recall Fallout

    = Recall + IRecall 1

    = Sensitivity + Specificity 1

    = (LR1)(1Specificity)= (1NLR)Specificity

    = (LR 1)(1NLR) / (LRNLR) (21)In the medical and behavioural sciences, theLikelihood Ratio is LR=Sensitivity/[1Specificity], andthe Negative Likelihood Ratio is NLR=Specificity/[1Sensitivity]. For non-negative B, LR>1>NLR, with 1as the chance case. We also express Informednessin these terms in (21).

    The Matthews/Pearson correlation is expressed inreduced form as the Geometric Mean of BookmakerInformedness and Markedness, abbreviating theirproduct as BookMark (BM) and recalling that it isBookMark that acts as a probability-like coefficient of

    determination, not its root, the Geometric Mean(BookMarkG or BMG):

    BMG = dtp/ [Prev (1Prev)Bias(1-Bias)]= dtp / [PrevG BiasG]= dtp / EvennessG=[(RecallBias)(PrecPrev)]/(IPrevIBias)(22)

    These equations clearly indicate how the Bookmaker

    coefficients of regression and correlation depend onlyon the proportion of True Positives and thePrevalence and Bias applicable to the respectivevariables. Furthermore, Prev Bias represents theExpected proportion of True Positives (etp) relativeto N, showing that the coefficients each represent theproportion of Delta True Positives (deviation fromexpectation, dtp=tp-etp) renormalized indifferent ways to give different probabilities.Equations (20-22) illustrate this, showing that thesecoefficients depend only on dtp and eitherPrevalence, Bias or their combination. Note that for aparticulardtp these coefficients are minimized when

    the Prevalence and/or Bias are at the evenly biased0.5 level, however in a learning or parameterizationcontext changing the Prevalence or Bias will ingeneral change both tp and etp, and hence canchange dtp.

    It is also worth considering further the relationship ofthe denominators to the Geometric Means, PrevG ofPrevalence and Inverse Prevalence (IPrev = 1Previs Prevalence of Real Negatives) and BiasG of Biasand Inverse Bias (IBias = 1Bias is bias to PredictedNegatives). These Geometric Means represent theEvenness of Real classes (EvennessR = PrevG2) andPredicted labels (EvennessP = BiasG2). We also

    introduce the concept of Global Evenness as theGeometric Mean of these two natural kinds ofEvenness, EvennessG. From this formulation we cansee that for a given relative delta of true positiveprediction above expectation (dtp), the correlation isat minimum when predictions and outcomes are bothevenly distributed (EvennessG = EvennessR =EvennessP = Prev = Bias = 0.5), and Markedness andBookmaker are individually minimal when Bias resp.Prevalence are evenly distributed (viz. Bias resp.Prev = 0.5). This suggests that setting Learner Bias(and regularized, cost-weighted or subsampledPrevalence) to 0.5, as sometimes performed inArtificial Neural Network training is in factinappropriate on theoretical grounds, as hasPreviously been shown both empirically and based onBayesian principles rather it is best to useLearner/Label Bias = Natural Prevalence which is ingeneral much less than 0.5 [12].

    Note that in the above equations (20-22) thedenominator is always strictly positive since we haveoccurrences and predictions of both Positives andNegatives by earlier assumption, but we note that if inviolation of this constraint we have a degenerate casein which there is nothing to predict or we make no

    effective prediction, then tp=etp and dtp=0, andall the above regression and correlation coefficients

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    8/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    44 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    are defined in the limit approaching zero. Thus thecoefficients are zero if and only if dtp is zero, andthey have the same sign as dtp otherwise.Assuming that we are using the model the right wayround, then dtp, B and M are non-negative, andBMG is similarly non-negative as expected. If the

    model is the wrong way round, then dtp, B, M andBMG can indicate this by expressing below chanceperformance, negative regressions and negativecorrelation, and we can reverse the sense of P tocorrect this.

    The absolute value of the determinant of thecontingency matrix, dp= dtp, in these probabilityformulae (20-22), also represents the sum of absolutedeviations from the expectation represented by anyindividual cell and hence 2dp=2DP/N is the totalabsolute relative error versus the null hypothesis.Additionally it has a geometric interpretation as the

    area of a trapezoid in PN-space, the unnormalizedvariant of ROC [6].

    We have already observed that in (normalized) ROCanalysis, Informedness is twice the triangular areabetween a positively informed system and the chanceline, and it thus corresponds to the area of thetrapezoid defined by a system (assumed to performno worse than chance), and any of its perversions(interchanging prediction labels but not the realclasses, or vice-versa, so as to derive a system thatperforms no better than chance), and the endpoints ofthe chance line (the trivial cases in which the system

    labels all cases true or conversely all are labelledfalse). Such a kite-shaped area is delimited by thedotted (system) and dashed (perversion) lines in Fig.1 (interchanging class labels), but the alternateparallelogram (interchanging prediction labels) is notshown. The Informedness of a perverted system isthe negation of the Informedness of the correctlypolarized system.

    We now also express the Informedness andMarkedness forms of DeltaP in terms of deviationsfrom expected values along with the Harmonic meanof the marginal cardinalities of the Real classes orPredicted labels respectively, defining DP,DELTAP, RH, PH and related forms in terms oftheir NRelative probabilistic forms defined asfollows:

    etp = rp pp; etn = rn pn (23)

    dp = tp etp = dtp

    = -dtn = -(tn etn)

    deltap = dtp dtn = 2dp (24)

    rh = 2rprn / [rp+rn] = rp2/ra2

    ph = 2pppn / [pp+pn] = pp2/pa2 (25)

    DeltaP' or Bookmaker Informedness may now beexpressed in terms ofdeltap and rh, and DeltaPor Markedness similarly in terms ofdeltap and ph:

    B = DeltaP' = [etp+dtp]/rp[efp-dtp]/rn

    = etp/rp efp/rn + 2dtp/rh

    = 2dp/rh = deltap/rh (26)

    M = DeltaP = 2dp/ph = deltap/ph (27)

    These harmonic relationships connect directly withthe previous geometric evenness terms by observingHarmonicMean = GeometricMean2/ArithmeticMeanas seen in (25) and used in the alternativeexpressions for normalization for Evenness in (26-27). The use of HarmonicMean makes therelationship with F-measure clearer, but use ofGeometricMean is generally preferred as a consistentestimate of central tendency that more accuratelyestimates the mode for skewed (e.g. Poisson) databounded below by 0 and unbounded above, and asthe central limit of the family of Lp based averages.Viz. the Geometric (L0) Mean is the Geometric Mean

    of the Harmonic (L1) and Arithmetic (L+1) Means,with positive values of p being biased higher (towardL+=Max) and negative values of p being biasedlower (toward L=Min).

    Effect of Bias and Prev on Recall and Precision

    The final form of the equations (26-27) cancels outthe common Bias and Prevalence (Prev) terms, thatdenormalizedtp to tpr (Recall) ortpa (Precision).We now recast the Bookmaker Informedness andMarkedness equations to show Recall and Precisionas subject (28-29), in order to explore the affect ofBias and Prevalence on Recall and Precision, as well

    as clarify the relationship of Bookmaker andMarkedness to these other ubiquitous but iniquitousmeasures.

    Recall = Bookmaker (1Prevalence) + BiasBookmaker = (Recall-Bias)/(1Prevalence) (28)

    Precision = Markedness (1-Bias) + PrevalenceMarkedness = (PrecisionPrevalence)/(1-Bias) (29)

    Bookmaker and Markedness are unbiased estimatorsof above chance performance (relative to respectivelythe predicting conditions or the predicted markers).Equations (28-29) clearly show the nature of the biasintroduced by both Label Bias and Class Prevalence.

    If operating at chance level, both Bookmaker andMarkedness will be zero, and Recall, Precision, andderivatives such as the F-measure, will be skewed bythe biases. Note that increasing Bias or decreasingPrevalence increases Recall and decreasesPrecision, for a constant level of unbiasedperformance. We can more specifically see that theregression coefficient for the prediction of Recall fromPrevalence is Informedness, and from Bias is +1,and similarly the regression coefficient for theprediction of Precision from Bias is Markedness,and from Prevalence is +1. Using the heuristic ofsetting Bias = Prevalence then sets Recall =Precision = F1 and Bookmaker Informedness =Markedness = Correlation. Setting Bias = 1

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    9/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 45

    (Prevalence represent substantialweighting up of the true unbiased performance inboth these measures, and hence also in F1. HighBias drives Recall up strongly and Precision downaccording to the strength of Informedness; highPrevalence drives Precision up and Recall downaccording to the strength of Markedness.

    Alternately, Informedness can be viewed (21) as arenormalization of Recall after subtracting off thechance level of Recall, Bias, and Markedness (20)can be seen as a renormalization of Precision aftersubtracting off the chance level of Precision,Prevalence (and Flachs WRAcc, the unbiased formbeing equivalent to Bookmaker Informedness, wasdefined in this way as discussed in 2.3).Informedness can also be seen (21) as arenormalization of LR or NLR after subtracting offtheir chance level performance. The Kappa measure[13-16] commonly used in assessor agreementevaluation was similarly defined as a renormalizationof Accuracy after subtracting off an estimate of theexpected Accuracy, for Cohen Kappa being the dotproduct of the Biases and Prevalences, andexpressible as a normalization of the discriminant ofcontingency, dtp, by the mean error rate (cf. F1;viz. Kappa is dtp/[dtp+mean(fp,fn)]). All threemeasures are invariant in the sense that they areproperties of the contingency tables that remainunchanged when we flip to the Inverse problem(interchange positive and negative for both conditionsand predictions). That is we observe:

    Inverse Informedness = Informedness,

    Inverse Markedness = Markedness,

    Inverse Kappa = Kappa.

    The Dual problem (interchange antecedent andconsequent) reverses which condition is the predictorand the predicted condition, and hence interchangesPrecision and Recall, Prevalence and Bias, as well asMarkedness and Informedness. For cross-evaluatoragreement, both Informedness and Markedness aremeaningful although the polarity and orientation of thecontingency is arbitrary. Similarly when examiningcausal relationships (conventionally DeltaP vsDeltaP'), it is useful to evaluate both deductive andabductive directions in determining the strength of

    association. For example, the connection betweencloud and rain involves cloud as one causal

    antecedent of rain (but sunshowers occuroccasionally), and rain as one causal consequent ofcloud (but cloudy days aren't always wet) only oncewe have identified the full causal chain can we reduceto equivalence, and lack of equivalence may be aresult of unidentified causes, alternate outcomes or

    both.

    The Perverse systems (interchanging the labels oneither the predictions or the classes, but not both)have similar performance but occur below the chanceline (since we have assumed strictly better thanchance performance in assigning labels to the givencontingency matrix).

    Note that the effect of Prevalence on Accuracy,Recall and Precision has also been characterizedabove (2.3) in terms of Flach's demonstration of howskew enters into their characterization in ROCanalysis, and effectively assigns different costs to(False) Positives and (False) Negatives. This can becontrolled for by setting the parameter cappropriately to reflect the desired skew and costtradeoff, with c=1 defining skew and cost insensitiveversions. However, only Informedness (orequivalents such as DeltaP' and skew-insensitiveWRAcc) precisely characterizes the probability withwhich a model informs the condition, and converselyonly Markedness (or DeltaP) precisely characterizesthe probability that a condition marks (informs) thepredictor. Similarly, only the Correlation (akaCoefficient of Proportionality aka Coefficient ofDetermination aka Squared Matthews CorrelationCoefficient) precisely characterizes the probabilitythat condition and predictor inform/mark each other,under our dichotomous assumptions. Note theTetrachoric Correlation is another estimate of thePearson Correlation made under the alternateassumption of an underlying continuous variable(assumed normally distributed), and is appropriate ifwe instead assume that we are dichotomizing anormal continuous variable [11]. But in this article weare making the explicit assumption that we aredealing with a right/wrong dichotomy that isintrinsically discontinuous.

    Although Kappa does attempt to renormalize adebiased estimate of Accuracy, and is thus muchmore meaningful than Recall, Precision, Accuracy,and their biased derivatives, it is intrinsically non-linear, doesn't account for error well, and retains aninfluence of bias, so that there does not seem thatthere is any situation when Kappa would bepreferable to Correlation as a standard independentmeasure of agreement [16,13]. As we have seen,Bookmaker Informedness, Markedness andCorrelation reflect the discriminant of relativecontingency normalized according to differentEvenness functions of the marginal Biases and

    Prevalences, and reflect probabilities relative to thecorresponding marginal cases.

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    10/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    46 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    However, we have seen that Kappa scales thediscriminant in a way that reflects the actual errorwithout taking into account expected error due tochance, and in effect it is really just using thediscriminant to scale the actual mean error: Kappa isdtp/[dtp+mean(fp,fn)] or equivalently it is

    1/[1+mean(fp,fn)/dtp] which approximates forsmall error to 1-mean(fp,fn)/dtp.

    The relatively good fit of Kappa to Correlation andInformedness is illustrated in Fig. 2, along with thepoor fit of the Rank Weighted Average and theGeometric and Harmonic (F-factor) means. The fit ofthe Evenness weighted determinant is perfect and noteasily distinguishable but the separate components

    (Determinant and geometric means of RealPrevalences and Prediction Biases) are also shown(+1 for clarity).

    Significance and Information Gain

    The ability to calculate various probabilities from acontingency table says nothing about the significanceof those numbers is the effect real, or is it within theexpected range of variation around the valuesexpected by chance? Usually this is explored byconsidering deviation from the expected values (ETPand its relatives) implied by the marginal counts (RP,

    PP and relatives) or from expected rates implied bythe biases (Class Prevalence and Label Bias). In thecase of Machine Learning, Data Mining, or otherartificially derived models and rules, there is thefurther question of whether the training andparameterization of the model has set the 'correct' or'best' Prevalence and Bias (or Cost) levels.Furthermore, should this determination be undertakenby reference to the model evaluation measures(Recall, Precision, Informedness, Markedness andtheir derivatives), or should the model be set tomaximize the significance of the results?

    This raises the question of how our measures ofassociation and accuracy, Informedness, Markednessand Correlation, relate to standard measures ofsignificance.

    This article has been written in the context of aPrevailing methodology in Computational Linguisticsand Information Retrieval that concentrates on targetpositive cases and ignores the negative case for thepurpose of both measures of association andsignificance. A classic example is saying water canonly be a noun because the system is inadequate tothe task of Part of Speech identification and thisboosts Recall and hence F-factor, or at least setting

    the Bias to nouns close to 1, and the Inverse Bias toverbs close to 0. Of course, Bookmaker will then be0 and Markedness unstable (undefined, and verysensitive to any words that do actually get labelledverbs). We would hope that significance would alsobe 0 (or near zero given only a relatively smallnumber of verb labels). We would also like to be ableto calculate significance based on the positive casealone, as either the full negative information isunavailable, or it is not labelled.

    Generally when dealing with contingency tables it isassumed that unused labels or unrepresented

    classes are dropped from the table, withcorresponding reduction of degrees of freedom. For

    Figure 2. Accuracy of traditional measures.110 Monte Carlo simulations with 11 stepped

    expected Informedness levels (red) with Bookmaker-estimated Informedness (red dot), Markedness (greendot) and Correlation (blue dot), and showing (dashed)

    Kappa versus the biased traditional measures RankWeighted Average (Wav), Geometric Mean (Gav) andHarmonic Mean F1 (Fav). The Determinant (D) andEvenness k-th roots (gR=PrevG and gP=BiasP) are

    shown +1. K=4, N=128.(Online version has figures in colour.)

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    11/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 47

    simplicity we have assumed that the margins are allnon-zero, but the freedoms are there whether theyare used or not, so we will not reduce them or reducethe table.

    There are several schools of thought about

    significance testing, but all agree on the utility ofcalculating a p-value [19], by specifying some statisticor exact test T(X) and setting p = Prob(T(X) T(Data)). In our case, the Observed Data issummarized in a contingency table and there are anumber of tests which can be used to evaluate thesignificance of the contingency table.

    For example, Fisher's exact test calculates theproportion of contingency tables that are at least asfavourable to the Prediction/Marking hypothesis,rather than the null hypothesis, and provides anaccurate estimate of the significance of the entirecontingency table without any constraints on thevalues or distribution. The log-likelihood-based G2test and Pearson's approximating 2 tests arecompared against a Chi-Squared Distribution ofappropriate degree of freedom (r=1 for the binarycontingency table given the marginal counts areknown), and depend on assumptions about thedistribution, and may focus only on the PredictedPositives.

    2 captures the Total Squared Deviation relative toexpectation, is here calculated only in relation topositive predictions as often only the overt predictionis considered, and the implicit prediction of negative

    case is ignored [17-19], noting that it sufficient tocount r=1 cells to determine the table and make asignificance estimate. However, 2 is valid only forreasonably sized contingencies (one rule of thumb isthat the expectation for the smallest cell is at least 5,and the Yates and Williams corrections will bediscussed in due course [18,19]):

    2+P= (TP-ETP)2/ETP+(FP-EFP)2/EFP

    = DTP2/ETP + DFP2/EFP

    = 2DP2/EHP, EHP

    = 2ETPEFP/[ETP+EFP]

    = 2Ndp2

    /ehp,ehp= 2etpefp/[etp+efp]

    = 2Ndp2/[rhpp]= Ndp2/PrevG2/Bias

    = NB2EvennessR/Bias = Nr2PPrevG2/Bias

    (N+PN)r2PPrevG2 (Bias 1)= (N+PN)B2EvennessR (30)

    G2 captures Total Information Gain, being N times theAverage Information Gain in nats, otherwise knownas Mutual Information, which however is normallyexpressed in bits. We will discuss this separatelyunder the General Case. We deal with G2 for positivepredictions in the case of small effect, that is dpclose to zero, showing that G2is twice as sensitive as2 in this range.

    G2+P/2=TPln(TP/ETP) + FPln(FP/EFP)

    =TPln(1+DTP/ETP)+FPln(1+DFP/EFP)

    TP(DTP/ETP) + FP(DFP/EFP)= 2Ndp2/ehp

    = 2Ndp2/[rhpp]

    = Ndp2/PrevG2/Bias

    = NB2EvennessR/Bias

    = Nr2PPrevG2/Bias

    (N+PN)r2PPrevG2 (Bias 1)= (N+PN)B2EvennessR (31)

    In fact 2 is notoriously unreliable for small N andsmall cell values, and G2 is to be preferred. The Yatescorrection (applied only for cell values under 5) is tosubtract 0.5 from the absolute dp value for that cellbefore squaring completing the calculation [17-19].

    Our result (30-1) shows that 2

    and G2

    significance ofthe Informedness effect increases with N asexpected, but also with the square of Bookmaker, theEvenness of Prevalence (EvennessR = PrevG2 =Prev(1Prev)) and the number of PredictedNegatives (viz. with Inverse Bias)! This is asexpected. The more Informed the contingencyregarding positives, the less data will be needed toreach significance. The more Biased the contingencytowards positives, the less significant each positive isand the more data is needed to ensure significance.The Bias-weighted average over all Predictions (herefor K=2 case: Positive and Negative) is simply

    KNB2

    PrevG2

    which gives us an estimate of thesignificance without focussing on either case inparticular.

    2KB = 2Ndtp

    2/PrevG2

    = 2NrP2 PrevG2

    = 2NrP2 EvennessR

    = 2NB2EvennessR (32)

    Analogous formulae can be derived for thesignificance of the Markedness effect for positive realclasses, noting that EvennessP = BiasG2 .

    2KM = 2Ndtp

    2/BiasG2

    = 2N rR2 BiasG2= 2NM2EvennessP (33)

    The Geometric Mean of these two overall estimatesfor the full contingency table is

    2KBM = 2Ndtp

    2/PrevGBiasG

    = 2NrPrR PrevGBiasG

    = 2Nr2GEvennessG= 2N2EvennessG

    = 2NBM EvennessG (34)

    This is simply the total Sum of Squares Deviance(SSD) accounted for by the correlation coefficientBMG (22) over the N data points discounted by the

    Global Evenness factor, being the squared GeometricMean of all four Positive and Negative Bias and

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    12/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    48 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    Prevalence terms (EvennessG= PrevGBiasG). Theless even the Bias and Prevalence, the more data willbe required to achieve significance, the maximumevenness value of 0.25 being achieved with botheven bias and even Prevalence. Note that for evenbias or Prevalence, the corresponding positive and

    negative significance estimates match the globalestimate.

    When 2+P orG2+P is calculated for a specific label ina dichotomous contingency table, it has one degreeof freedom for the purposes of assessment ofsignificance. The full table also has one degree offreedom, and summing for goodness of fit over onlythe positive prediction label will clearly lead to a lower2 estimate than summing across the full table, andwhile summing for only the negative label will oftengive a similar result it will in general be different. Thusthe weighted arithmetic mean calculated by 2KB is

    an expected value independent of the arbitrary choiceof which predictive variate is investigated. This isused to see whether a hypothesized main effect (thealternate hypothesis, HA) is borne out by a significantdifference from the usual distribution (the nullhypothesis, H0). Summing over the entire table(rather than averaging of labels), is used for2 orG2independence testing independent of any specificalternate hypothesis [21], and can be expected toachieve a 2 estimate approximately twice thatachieved by the above estimates, effectivelycancelling out the Evenness term, and is thus far lessconservative (viz. it is more likely to satisfy p

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    13/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 49

    around X=2 as, given the central limit theorem appliesand the distribution can be regarded as normal, amultiplier of 1.96 corresponds to a confidence of 95%that the true mean lies in the specified interval aroundthe estimated mean, viz. the probability that thederived confidence interval will bound the true mean

    is 0.95 and the test thus corresponds approximatelyto a significance test with alpha=0.05 as theprobability of rejecting a correct null hypothesis, or apower test with beta=0.05 as the probability ofrejecting a true full or partial correlation hypothesis. Anumber of other distributions also approximate 95%confidence at 2SE.

    We specifically reject the more traditional approachwhich assumes that both Prevalence and Bias arefixed, defining margins which in turn define a specificchance case rather than an isocost line representingall chance cases we cannot assume that any

    solution on an isocost line has greater error than anyother since all are by definition equivalent. The aboveapproach is thus argued to be appropriate forBookmaker and ROC statistics which are based onthe isocost concept, and reflects the fact that mostpractical systems do not in fact preset the Bias ormatch it to Prevalence, and indeed Prevalences inearly trials may be quite different from those in thefield.

    The specific estimate of sse that we present foralpha, the probability of the current estimate for Boccurring if the true Informedness is B=0,issseB0=|1-B|=1, which is appropriate for testingthe null hypothesis, and thus for definingunconventional error bars on B=0. Conversely,sseB2=|B|=0, is appropriate for testing deviationfrom the full hypothesis in the absence ofmeasurement error, whilst sseB2=|B|=1conservatively allows for full range measurementerror, and thus defines unconventional error bars onB=M=C=1.

    In view of the fact that there is confusion between theuse ofbeta in relation to a specific full dependencyhypothesis, B=1 as we have just considered, and theconventional definition of an arbitrary and unspecific

    alternate contingent hypothesis, B0, we designatethe probability of incorrectly excluding the fullhypothesis by gamma, and propose three possiblerelated kinds of correction for the sse forbeta:some kind of mean of |B| and |1-B| (the unweightedarithmetic mean is 1/2, the geometric mean is lessconservative and the harmonic mean leastconservative), the maximum or minimum (actually a

    special case of the last, the maximum beingconservative and the minimum too low anunderestimate in general), or an asymmetric intervalthat has one value on the null side and another on thefull side (a parameterized special case of the last thatcorresponds to percentile-based usages like box

    plots, being more appropriate to distributions thatcannot be assumed to be symmetric).

    The sse means may be weighted or unweightedand in particular a self-weighted arithmetic meangives our recommended definition, sseB1=1-2|B|+2B2, whilst an unweighted geometric meangives sseB1=[|B|-B2] and an unweightedharmonic mean gives sseB1=|B|-B2. All of theseare symmetric, with the weighted arithmetic meangiving a minimum of 0.5 at B=0.5 and a maximum of1 at both B=0 and B=1, contrasting maximally withsseB0and sseB2resp in these neighbourhoods,

    whilst the unweighted harmonic and geometric meanshaving their minimum of 0 at both B=0 and B=1,acting like sseB0and sseB2resp in theseneighbourhoods (which there evidence zero variancearound their assumed true values). The minimum atB=0.5 for the geometric mean is 0.5 and for theharmonic mean, 0.25.

    For this probabilistic |B| range, the weightedarithmetic mean is never less than the arithmeticmean and the geometric mean is never more thanthe arithmetic mean. These relations demonstrate thecomplementary nature of the weighted/arithmetic and

    unweighted geometric means. The maxima at theextremes is arguably more appropriate in relation topower as intermediate results should calculatesquared deviations from a strictly intermediateexpectation based on the theoretical distribution, andwill thus be smaller on average if the theoreticalhypothesis holds, whilst providing emphasizeddifferentiation when near the null or full hypothesis.The minima of 0 at the extremes are not veryappropriate in relation to significance versus the nullhypothesis due the expectation of a normaldistribution, but its power dual versus the fullhypothesis is appropriately a minimum as perfect

    correlation admits no error distribution. Based onMonte Carlo simulations, we have observed thatsetting sseB1=sseB2=1-|B| as per the usualconvention is appropriately conservative on theupside but a little broad on the downside, whilst theweighted arithmetic mean, sseB1=1-2|B|+2B2, issufficiently conservative on the downside, butunnecessarily conservative for high B.

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    14/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    50 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    Note that these two-tailed ranges are valid forBookmaker Informedness and Markedness that cango positive or negative, but a one tailed test would beappropriate for unsigned statistics or where aparticular direction of prediction is assumed as wehave for our contingency tables. In these cases asmaller multiplier of 1.65 would suffice, however theconvention is to use the overlapping of the confidencebars around the various hypotheses (although usuallythe null is not explicitly represented).

    Thus for any two hypotheses (including the nullhypothesis, or one from a different contingency table

    or other experiment deriving from a different theory orsystem) the traditional approach of checking that1.95SE (or 2SE) error bars dont overlap is ratherconservative (it is enough for the value to be outsidethe range for a two-sided test), whilst checkingoverlap of 1SE error bars is usually insufficientlyconservative given that the upper representsbeta

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    15/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 51

    significant to the 0.05 level due to the lowInformedness Markedness and Correlation, howeverdoubling the performance of the system would sufficeto achieve significance at N=100 given the Evennessspecified by the Prevalences and/or Biases).Moreover, even at the current performance levels the

    Inverse (Negative) and Dual (Marking) Problemsshow higher 2 significance, approaching the 0.05level in some instances (and far exceeding it for theInverse Dual). The KB variant gives a singleconservative significance level for the entire table,sensitive only to the direction of proposed implication,and is thus to be preferred over the standard versionsthat depend on choice of condition.

    Incidentally, the Fisher Exact Test shows significanceto the 0.05 level for both the examples in Table 2.This corresponds to an assumption of ahypergeometric distribution rather than normality

    viz. all assignments of events to cells are assumed tobe equally likely given the marginal constraints (Biasand Prevalence). However it is in appropriate giventhe Bias and Prevalence are not specified by theexperimenter in advance of the experiment as isassumed by the conditions of this test. This has alsobeen demonstrated empirically through Monte Carlosimulation as discussed later. See [22] for acomprehensive discussion on issues with significancetesting, as well as Monte Carlo simulations.

    PRACTICAL CONSIDERATIONS

    If we have a fixed size dataset, then it is arguablysufficient to maximize the determinant of theunnormalized contingency matrix, DT. However thisis not comparable across datasets of different sizes,and we thus need to normalize for N, and henceconsider the determinant of the normalizedcontingency matrix, dt. However, this value is stillinfluenced by both Bias and Prevalence.

    In the case where two evaluators or systems arebeing compared with no a priori preference, theCorrelation gives the correct normalization by theirrespective Biases, and is to be preferred to Kappa.

    In the case where an unimpeachable Gold Standardis employed for evaluation of a system, theappropriate normalization is for Prevalence orEvenness of the real gold standard values, givingInformedness. Since this is constant, optimizingInformedness and optimizing dtare equivalent.

    More generally, we can look not only at whatproposed solution best solves a problem, bycomparing Informedness, but which problem is mostusefully solved by a proposed system. In a medicalcontext, for example, it is usual to come up withpotentially useful medications or tests, and then

    explore their effectiveness across a wide range ofcomplaints. In this case Markedness may be

    appropriate for the comparison of performance acrossdifferent conditions.

    Recall and Informedness, as biased and unbiasedvariants of the same measure, are appropriate fortesting effectiveness relative to a set of conditions,

    and the importance of Recall is being increasinglyrecognized as having an important role in matchinghuman performance, for example in Word Alignmentfor Machine Translation [1]. Precision andMarkedness, as biased and unbiased variants of thesame measure, are appropriate for testingeffectiveness relative to a set of predictions. This isparticularly appropriate where we do not have anappropriate gold standard giving correct labels forevery case, and is the primary measure used inInformation Retrieval for this reason, as we cannotknow the full set of relevant documents for a queryand thus cannot calculate Recall.

    However, in this latter case of an incompletelycharacterized test set, we do not have a fullyspecified contingency matrix and cannot apply any ofthe other measures we have introduced. Rather,whether for Information Retrieval or Medical Trials, itis assumed that a test set is developed in which allreal labels are reliably (but not necessarily perfectly)assigned. Note that in some domains, labels areassigned reflecting different levels of assurance, butthis has lead to further confusion in relation topossible measures and the effectiveness of thetechniques evaluated [1]. In Information Retrieval,the labelling of a subset of relevant documentsselected by an initial collection of systems can lead torelevant documents being labelled as irrelevantbecause they were missed by the first generationsystems so for example systems are actuallypenalized for improvements that lead to discovery ofrelevant documents that do not contain all specifiedquery words. Thus here too, it is important to developtest sets that of appropriate size, fully labelled, andappropriate for the correct application of bothInformedness and Markedness, as unbiased versionsof Recall and Precision.

    This Information Retrieval paradigm indeed provides

    a good example for the understanding of theInformedness and Markedness measures. Not onlycan documents retrieved be assessed in terms ofprediction of relevance labels for a query usingInformedness, but queries can be assessed in termsof their appropriateness for the desired documentsusing Markedness, and the different kinds of searchtasks can be evaluated with the combination of thetwo measures. The standard Information Retrievalmantra that we do not need to find all relevantdocuments (so that Recall or Informedness is not sorelevant) applies only where there are huge numbersof documents containing the required information and

    a small number can be expected to provide thatinformation with confidence. However another kind of

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    16/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    52 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    Document Retrieval task involves a specific andrather small set of documents for which we need tobe confident that all or most of them have been found(and so Recall or Informedness are especiallyrelevant). This is quite typical of literature review in aspecialized area, and may be complicated by new

    developments being presented in quite different formsby researchers who are coming at it from differentdirections, if not different disciplinary backgrounds.

    THE GENERAL CASE

    So far we have examined only the binary case withdichotomous Positive versus Negative classes andlabels.

    It is beyond the scope of this article to consider thecontinuous or multi-valued cases, although theMatthews Correlation is a discretization of thePearson Correlation with its continuous-valued

    assumption, and the Spearman Rank Correlation isan alternate form applicable to arbitrary discrete value(Likert) scales, and Tetrachoric Correlation isavailable to estimate the correlation of an underlyingcontinuous scale [11]. If continuous measurescorresponding to Informedness and Markedness arerequired due to the canonical nature of one of thescales, the corresponding Regression Coefficientsare available.

    It is however, useful in concluding this article toconsider briefly the generalization to the multi-classcase, and we will assume that both real classes and

    predicted classes are categorized with K labels, andagain we will assume that each class is non-emptyunless explicitly allowed (this is because Precision isill-defined where there are no predictions of a label,and Recall is ill-defined where there are no membersof a class).

    Generalization of Association

    Powers [4] derives Bookmaker Informedness (41)analogously to Mutual Information & ConditionalEntropy (39-40) as a pointwise average across thecontingency cells, expressed in terms of labelprobabilities PP(l), where PP(l) is the probability of

    Prediction l, and label-conditioned class probabilitiesPR(c|l) , where PR(c|l) is the probability that thePrediction labeled lis actually of Real class c, and inparticular PR(l|l) = Precision(l), and where we use thedelta functions as mathematical shorthands forBoolean expressions interpreted algorithmically as inC, with true expressions taking the value 1 and falseexpressions 0, so that |c-l| (c= l) represents a Diracmeasure (limit as 0); |c-l| (c l) represents itslogical complement (1 ifcland 0 ifc= l)).

    MI(R||P) =l PP(l)cPR(c|l) [log(PR(c|l))/PR(c)] (39)

    H(R|P) =l PP(l)cPR(c|l) [log(PR(c|l))] (40)

    B(R|P) =l PP(l)cPR(c|l) [PP(l)/(PR(l) |c-l|)] (41)

    We now define a binary dichotomy for each label lwith land the corresponding cas the Positive cases(and all other labels/classes grouped as the Negativecase). We next denote its Prevalence Prev(l) and itsdichotomous Bookmaker Informedness B(l), and socan simplify (41) to

    B(R|P) = lPrev(l) B(l) (42)

    Analogously we define dichotomous Bias(c) andMarkedness(c) and derive

    M(P|R) = c Bias(c) M(c) (43)

    These formulations remain consistent with thedefinition of Informedness as the probability of aninformed decision versus chance, and Markedness asits dual. The Geometric Mean of multi-classInformedness and Markedness would appear to giveus a new definition of Correlation, whose square

    provides a well defined Coefficient of Determination.Recall that the dichotomous forms of Markedness(20) and Informedness (21) have the determinant ofthe contingency matrix as common numerators, andhave denominators that relate only to the margins, toPrevalence and Bias respectively. Correlation,Markedness and Informedness are thus equal whenPrevalence = Bias. The dichotomous CorrelationCoefficient would thus appear to have three factors, acommon factor across Markedness andInformedness, representing their conditionaldependence, and factors representing Evenness ofBias (cancelled in Markedness) and Evenness of

    Prevalence (cancelled in Informedness), eachrepresenting a marginal independence.

    In fact, Bookmaker Informedness can be drivenarbitrarily close to 0 whilst Markedness is drivenarbitrarily close to 1, demonstrating theirindependence in this case Recall and Precision willbe driven to or close to 1. The arbitrarily close hedgerelates to our assumption that all predicted and realclasses are non-empty, although appropriate limitscould be defined to deal with the divide by zeroproblems associated with these extreme cases.Technically, Informedness and Markedness are

    conditionally independent once the determinantnumerator is fixed, their values depend only on theirrespective marginal denominators which can varyindependently. To the extent that they areindependent, the Coefficient of Determination acts asthe joint probability of mutual determination, but to theextent that they are dependent, the CorrelationCoefficient itself acts as the joint probability of mutualdetermination.

    These conditions carry over to the definition ofCorrelation in the multi-class case as the GeometricMean of Markedness and Informedness once allnumerators are fixed, the denominators demonstratemarginal independence.

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    17/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 53

    We now reformulate the Informedness andMarkedness measures in terms of the Determinant ofthe Contingency and Evenness, generalizing (20-22).In particular, we note that the definition of Evennessin terms of the Geometric Mean or product of biasesor Prevalences is consistent with the formulation in

    terms of the determinants DET and det(generalizing dichotomous DP=DTP and dp=dtp)

    and their geometric interpretation as the area of aparallelogram in PN-space and its normalization toROC-space by the product of Prevalences, givingInformedness, or conversely normalization toMarkedness by the product of biases. Thegeneralization of DET to a volume in high

    dimensional PN-space and det to its normalizationby product of Prevalences or biases, is sufficient toguarantee generalization of (20-22) to K classes byreducing from KD to SSD so that BMG has the formof a coefficient of proportionality of variance:

    M [det / BiasGK]2/K

    = det2/K / EvennessP+ (44)

    B [det / PrevGK ]2/K

    = det2/K / EvennessR+ (45)

    BMG det2/K / [PrevG BiasG]= det2/K / EvennessG+ (46)

    We have marked the Evenness terms in theseequations with a trailing plus to distinguish them fromother usages, and their definitions are clear fromcomparison of the denominators. Note that theEvenness terms for the generalized regressions (44-45) are not Arithmetic Means but have the form ofGeometric Means. Furthermore, the dichotomouscase emerges forK=2 as expected. Empirically (Fig.3), this generalization matches well near B=0 or B=1,but fares less well in between the extremes,suggesting a mismatched exponent in the heuristicconversion of K dimensions to 2. Here we set up theMonte Carlo simulation as follows: we define the

    diagonal of a random perfect performancecontingency table with expected N entries using arandom uniform distribution, we define a randomchance level contingency table setting marginsindependently using a random binormal distribution,then distributing randomly across cells around theirexpected values, we combine the two (perfect andchance) random contingency tables with respectiveweights I and (1-I), and finally increment ordecrement cells randomly to achieve cardinality Nwhich is the expected number but is not constrainedby the process for generating the random (perfectand chance) matrices. This procedure was used to

    ensure Informedness and Markedness estimatesretain a level of independence; otherwise they tend tocorrelate very highly with overly uniform margins forhigher K and lower N (conditional independence islost once the margins are specified) and in particularInformedness, Markedness, Correlation and Kappawould always agree perfectly for either I=1 orperfectly uniform margins. Note this use ofInformedness to define a target probability of aninformed decision followed by random inclusion ordeletion of cases when there is a mismatch versusthe expected number of instances N the presetInformedness level is thus not a fixed preset

    Informedness but a target level that permits jitteraround that level, and in particular will be an

    Figure 3. Determinant-based estimates of correlation.

    110 Monte Carlo simulations with 11 stepped expected

    Informedness levels (red line) with Bookmaker-

    estimated Informedness (red dots), Markedness (green

    dot) and Correlation (blue dot), with significance (p+1)

    calculated using G2, X2, and Fisher estimates, and

    Correlation estimates calculated from the Determinant of

    Contingency using two different exponents, 2/K (DB &

    DM) and 1/[3K-2] (DBa and DMa). The difference

    between the estimates is also shown.

    Here K=4, N=128, X=1.96, ==0.05.

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    18/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    54 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    overestimate for the step I=1 (no negative countspossible) which can be detected by excess deviationbeyond the set Confidence Intervals for highInformedness steps.

    In Fig. 3 we therefore show and compare an alternate

    exponent of 1/(3K-2) rather than the exponent of 2/Kshown in (44 to 45). This also reduces to 1 andhence the expected exact correspondence for K=2.This suggests that what is important is not just thenumber of dimensions, but the also the number ofmarginal degrees of freedom: K+2(K-1), butalthough it matches well for high degrees ofassociation it shows similar error at low informedness.The precise relationship between Determinant andCorrelation, Informedness and Markedness for thegeneral case remains a matter for furtherinvestigation. We however continue with the use ofthe approximation based on 2/K.

    The EvennessR

    (Prev.IPrev) concept corresponds tothe concept of Odds (IPrev/Prev), wherePrev+IPrev=1, and Powers [4] shows that (multi-class) Bookmaker Informedness corresponds to theexpected return per bet made with a fair Bookmaker(hence the name). From the perspective of a givenbet (prediction), the return increases as theprobability of winning decreases, which means thatan increase in the number of other winners canincrease the return for a bet on a given horse(predicting a particular class) through changing thePrevalences and thus Evenness

    Rand the Odds. The

    overall return can thus increase irrespective of thesuccess of bets in relation to those new wins. Inpractice, we normally assume that we are making ourpredictions on the basis of fixed (but not necessarilyknown) Prevalences which may be estimated a priori(from past data) or post hoc (from the experimentaldata itself), and for our purposes are assumed to beestimated from the contingency table.

    Generalization of Significance

    In relation to Significance, the single class +P2 and

    G+P2 definitions both can be formulated in terms of

    cell counts and a function of ratios, and would

    normally be summed over at least (K1)2 cells of a K-class contingency table with (K1)2 degrees offreedom to produce a statistic for the table as awhole. However, these statistics are not independentof which variables are selected for evaluation orsummation, and the p-values obtained are thus quitemisleading, and for highly skewed distributions (interms of Bias or Prevalence) can be outlandishlyincorrect. If we sum log-likelihood (31) over all K2cells we get NMI(R||P) which is invariant overInverses and Duals.

    The analogous Prevalence-weighted multi-classstatistic generalized from the BookmakerInformedness form of the Significance statistic, andthe Bias-weighted statistic generalized from the

    Markedness form, extend Eqns 32-34 to the K>2case by probability-weighted summation (this is aweighted Arithmetic Mean of the individual casestargeted to r=K-1 degree of freedom):

    2KB = KNB2EvennessR (47)

    2KM = KNM2EvennessP (48)

    2KBM= KNBMEvennessG (49)

    For K=2 and r=1, the Evenness terms were theproduct of two complementary Prevalence or Biasterms in both the Bookmaker derivations and theSignificance Derivations, and (30) derived a singlemultiplicative Evenness factor from a squaredEvenness factor in the numerator deriving from

    dtp2, and a single Evenness factor in thedenominator. We will discuss both these Evennessterms in the a later section. We have marked theEvenness terms in (47-49) with a trailing minus todistinguish them from forms used in (20-22,44-46).

    One specific issue with the goodness-of-fit approachapplied to K-class contingency tables relates to theup to (K1)2 degrees of freedom, which we focus onnow. The assumption of independence of the countsin (K1)2 of the cells is appropriate for testing the nullhypothesis, H0, and the calculation versus alpha,but is patently not the case when the cells aregenerated by K condition variables and K predictionvariables that mirror them. Thus a correction is inorder for the calculation of beta for some specificalternate hypothesis HA or to examine the significance

    of the difference between two specific hypotheses HAand HB which may have some lesser degree ofdifference.

    Whilst many corrections are possible, in this casecorrecting the degrees of freedom directly seemsappropriate and whilst using r = (K1)2 degrees offreedom is appropriate foralpha, using r = K1degrees of freedom is suggested forbeta under theconditions where significance is worth testing, giventhe association (mirroring) between the variables isalmost complete. In testing against beta, as athreshold on the probability that a specific alternate

    hypothesis of the tested association being validshould be rejected. The difference in a 2 statisticbetween two systems (r = K1) can thus be testedfor significance as part of comparing two systems (theCorrelation-based statistics are recommended in thiscase). The approach can also compare a systemagainst a model with specified Informedness (orMarkedness). Two special cases are relevant here,H0, the null hypothesis corresponding to nullInformedness (B = 0: testing alpha with r =(K1)2), and H1, the full hypothesis corresponding tofull Informedness (B = 1: testing beta with r =K1).

    Equations 47-49 are proposed for interpretation underr = K1 degrees of freedom (plus noise) and are

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    19/27

    Powers DMW

    Copyright 2011 Bioinfo Publications 55

    hypothesized to be more accurate for investigatingthe probability of the alternate hypothesis in question,HA (beta).

    Equations 50-52 are derived by summing over the(K1) complements of each class and label before

    applying the Prevalence or bias weighted sum acrossall predictions and conditions. These measures arethus applicable for interpretation under r = (K1)2degrees of freedom (plus biases) and aretheoretically more accurate for estimating theprobability of the null hypothesis H0 (alpha). Inpractice, the difference should always be slight (asthe cumulative density function of the gammadistribution 2 is locally near linear in r see Fig. 4)reflecting the usual assumption that alpha andbeta may be calculated from the same distribution.Note that there is no difference in either the formulaenorr when K=2.

    2XB = K(K1)NB2EvennessR (50)

    2XM = K(K1)NM2EvennessP (51)

    2XBM = K(K1)NBMEvennessG (52)

    Equations 53-55 are applicable to nave unweightedsummation over the entire contingency table, but also

    correspond to the independence test with r = (K1)2degrees of freedom, as well as slightlyunderestimating but asymptotically approximating thecase where Evenness is maximum in (50-52) at1/K2. When the contingency table is uneven,Evenness factors will be lower and a more

    conservative p-value will result from (50-52), whilstsumming naively across all cells (53-55) they canlead to inflated statistics and underestimated p-values. However, they are the equations thatcorrespond to common usage of the 2 and G2statistics as well as giving rise implicitly to Cramers V= [2/N(K-1)]1/2 as the corresponding estimate ofthe Pearson correlation coefficient, , so thatCramers V is thus also likely to be inflated as anestimate of association where Evenness is low. Wehowever, note these, consistent with the usualconventions, as our definitions of the conventionalforms of the 2 statistics applied to the multiclassgeneralizations of the Bookmakeraccuracy/association measures:

    2B = (K1)NB2 (53)

    2M = (K1)NM2 (54)

    2BM = (K1)NBM (55)

    Figure 4. Chi-squared against degrees of freedom cumulative density isocontours(relative to = 0.05: cyan/yellow boundary of p/=1=1E0)

  • 7/31/2019 201101 Evaluation JMLT Postprint Colour

    20/27

    Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

    56 Journal of Machine Learning Technology

    ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011

    Note that Cramers V calculated from standard fullcontingency 2 and G2 estimates tends vastlyoverestimate the level of association as measured byBookmaker and Markedness or constructedempirically. It is also important to note that the fullmatrix significance estimates (and hence Cramers V

    and similar estimates from these 2 statistics) areindependent of the permutations of predicted labels(or real classes) assigned to the contingency tables,and that in order to give such an independentestimate using the above family of Bookmakerstatistics, it is essential that the optimal assignment oflabels is made perverse solutions with suboptimalallocations of labels will underestimate thesignificance of the contingency table as they clearlydo take into account what one is trying todemonstrate and how well we are achieving that goal.

    The empirical observation concerning Cramers V

    suggests that the strict probabilistic interpretation ofthe multiclass generalized Informedness andMarkedness measures (probability of an informed ormarked decision), is not reflected by the traditionalcorrelation measures, the squared correlation being acoefficient of proportionate determination of varianceand that outside of the 2D case where they match upwith BMG, we do not know how to interpret them as aprobability. However, we also note that Informednessand Markedness tend to correlate and are at mostconditionally independent (given any one cell, e.ggiven tp), so that their product cannot necessarily beinterpreted as a joint probability (they are

    conditionally dependent given a margin, viz.prevalence rp or bias pp: specifying one of B or Mnow constrains the other; setting bias=prevalence, asa common heuristic learning constraint, maximizescorrelation at BMG=B=M).

    We note further that we have not considered atetrachoric correlation, which estimates theregression of assumed underlying continuousvariables to allow calculation of their PearsonCorrelation.

    Sketch Proof of General Chi-squared Test

    The traditional 2 statistic sums over a number ofterms specified by r degrees of freedom, stoppingonce dependency emerges. The G2 statistic derivesfrom a log-likelihood analysis which is alsoapproximated, but less reliably, by the 2 statistic. Inboth cases, the variates are assumed to beasymptotically normal and are expected to benormalized to mean =0, standard deviation =1, andboth the Pearson and Matthews correlation and the2 and G2 significance statistics implicitly performsuch a normalization. However, this leads tosignificance statistics that vary according to whichterm is in focus if we sum overr rather than K2. In

    the binary dichotomous case, it makes sense to sumover only the condition of primary focus, but in the

    general case it involves leaving out one case (labeland class). By the Central Limit Theorem, summingover (K-1)2 such independent z-scores gives us anormal distribution with =(K-1).

    We define a single case 2+lP from the 2+P (30)

    calculated for label l = class c as the positivedichotomous case. We next sum over