copies of this poster, paper, spreadsheets, scripts may be obtained from:...

http://www.csem.flinders.edu.au/people/pages/powers_david/

Copies of this poster, paper, spreadsheets, scripts may be obtained from: [email protected] or [email protected] or

http://david.wardpowers.info/BM

[1] Powers, David M. W. (2003), Recall and Precision versus the Bookmaker, Proceedings of the International Conference on Cognitive Science (ICSC-2003), Sydney Australia, 2003, pp. 529-534. http://david.wardpowers.info/BM/index.htm

[2] Powers, David M. W. (2007) Evaluation, Flinders InfoEng Tech Report SIE07001 http://www.infoeng.flinders.edu.au/research/techreps/SIE07001.pdf

Contact Details References

European Conference on Artificial Intelligence, July 2008, Patras Greece

David M. W. Powers is Associate Professor of Computer Science and Head of the AILab at Flinders University. He specializesin applications of unsupervised learning tolanguage and speech processing. Dr Powers undertook his PhD in this area, as well asco-founding ACL’s SIGNLL and CoNLL. He isalso a trader and has a Diploma in TechnicalAnalysis, being the study of how to find and exploit ‘edges’ in the financial markets.

+R −R +R −R

+P tp fp pp +P A B A+B

−P fn tn pn −P C D C+D

rp rn 1 A+C B+D N

Recall = tpr = tp/rp

Precision = tpa = tp/pp

Fallout = fpr = tp/rp

Imprecision = fna = fn/pnThe traditional measure for Machine Learning/Natural Language experiments have been

borrowed from Information Retrieval, where Precision reflects the accuracy of positive predictions, and Recall reflects the rate of return of real positives. These measures are totally independent of the number of negative cases that are truly predicted as negative (TN=D). The F-factor is a harmonic mean of Recall and Precision and thus also ignores these cases. The Rand Accuracy can be regarded as a weighted average of Recall and Inverse Recall, or of Precision and Inverse Precision (where the Inverse problem reverses positive and negative), and thus does reflect both. Often however, we do no know all the positive or negative cases, so we cannot calculate anything but Precision and derivatives of Recall based on the known subsets.

A second problem with all of these measures is that they are heavily influenced by two kinds of bias – Prevalence, the proportion of real positives, rp=A+C/N; and prediction Bias, the proportion of positive predictions, pp=A+B/N. A third issue is the confidence we have in these values, or the significance of the purported results.

Definition 1 Informedness quantifies how informed a predictor is for the specified condition, and specifies the probability that a prediction is informed in relation to the condition.Informedness = Recall + Inverse Recall – 1 = tpr-fpr = 1-fnr-fpr = (Recall − Bias) / (1−Prevalence)Confidence Interval CI = X.(1-|Informedness|/√[N-1]) where X= -Φ–1(α/sides)

Definition 2 Markedness quantifies how marked a condition is for the specified predictor, and specifies the probability that a condition is marked by the predictor.Markedness = Precision + Inverse Precision – 1 = tpa-fna = 1-fpa-fna = (Precision − Prevalence) / (1− Bias)Confidence Interval CM = CM = X.(1-|Markedness|/√[N-1]) where X= -Φ–1(α/sides)

Powers [1] derived an unbiased measure, Informedness, based on the idea of an “edge” in gambling, given knowledge of the underlying base probabilities and rewarding success and penalizing failures according fair odds (Bookmaker) for an arbitrary number of classes.

Dichotomous cases – Informedness, Markedness & CorrelationIn the case of just two classes (+ and –), Informedness can be understood in terms of ROC analysis either as the distance from a specific prediction system to the chance line, as the Unbiased Weighted Relative Accuracy (WRAcc for Skew=0), as twice the area between the curve and the chance line, or in terms of the Area Under the Curve (AUC) as 2AUC-1. In Psychology, Dichotomous Informedness corresponds to DeltaP'. DeltaP is identified empirically as the normative predictor of human associative judgements, as one concept primes or marks another. Whereas Informedness corresponds and DeltaP' and is the unbiased form Recall, DeltaP corresponds to Markedness, how much the actual class influences or marks the selected predictor, and relates to Precision, being dual regression coefficients whose produce is the Matthews Correlation, and associating these with Cramer’s V gives us the corresponding χ2 significance estimates. Similarly we can directly set confidence intervals [2].

Illustration of ROC Analysis. The main diagonal represents chance. Points above the diagonal represent performance better than chance, those below worse than chance.

besttpr

fprworst

good

perverse

sse0

B

sse1

Rand Accuracy = tp+tn

p=0

0.2

I =

Chi2, G2 & Fisher significance (α=0.05)

(α=0.05)

(γ=0.05)

(β+=0.05)

(β+=0.05)

(β*=0.05)

- Monte Carlo Model Informedness, I• Bookmaker Informedness, B• Bookmaker Markedness, M• Bookmaker Correlation, C

Illustration of Significance and Confidence. 110 Monte Carlo simulations with 11 stepped expected Informedness levels (red line) with Bookmaker-estimated Informedness (red dots), Markedness (green dot) and Correlation (blue dot), with significance (p+1) calculated using G2, χ2, and Fisher estimates, and confidence bands shown for both the

theoretical Informedness and the B=0 and B=1 levels (parallel at B=0.18,0.82). The lower theoretical band is calculated twice, using both CIB1 and CIB2. Here K=5, N=128, X=1.96 for two-sided α=β=0.05.

David M. W. PowersAILAB, Schhol of Computer Science,

Mathematics and Engineering

Flinders University of South AustraliaG.P.O. Box 2100, ADELAIDE 5001

SOUTH AUSTRALIA

David M. W. Powers AILAB, School of Computer Science,


Flinders University of South Australia

G.P.O. Box 2100, ADELAIDE 5001

SOUTH AUSTRALIA

Evaluation in NLP is biased…

• Consider a POS tagging task

– Is this word “water” a Noun in this context?

• For chance level performance

– Precision reflects Prevalence (is_Noun 90%)

– Recall reflects Bias (tagged_Noun 100%)

– Accuracy/F-factor reflect both (90%/95%)

High Accuracy/F-factor can arise from high bias!

Proportion of outcomes +veProportion of +ve predictions correct

Proportion of predictions +veProportion of +ve outcomes correct

Proportion of outcomes or predictions correct

Proportion of +ve outcomes and predictions correct

Medicine, Psychology & Gambling

• Receiver Operating Characteristics (ROC)– Recall vs Fallout tradeoff (tpr vs fpr)

– Unbiased default or can allow for cost / skew

• Normative Measure of Contingency (DeltaP)– Predictor of Human Associative Judgements

• Bookmaker Gambling Edge (Informedness)– Expected profit betting with fair odds

What is the probability of an informed choice?

Correlation and Significance• Informedness or DeltaP′ = Recall + invRecall – 1 (0)

– Probability predictions are informed vs chance outcome

• Markedness or DeltaP = Precision + InvPrecision – 1 (0)

– Probability outcomes are marked vs chance prediction

• Correlation: ρ2 = Markedness∙Informedness (0)

– Probability variance is explained reciprocally

• Significance: χ2 = N∙Evenness∙Markedness∙Informedness (0)

– Probability rejecting true null (Type I Error): Q(r/2,χ2/2)

http://david.wardpowers.info/BM [email protected]

Proportion of time

Proportion of time

Proportion of time

Proportion of time

vs Predictions are correct (+ve: Precision; all: Accuracy)

vs Real outcomes are correct (+ve: Recall; all: Accuracy)

Evenness = 2∙ (Prev∙invPrev∙Bias∙invBias)1/2

Recall = Informedness∙invPrev+Bias

Precision = Markedness∙invBias+Prev

Superstitious contingency


© 2008 David M. W. Powers AILab, School of Computer Science,


Flinders University of South Australia

G.P.O. Box 2100, ADELAIDE 5001

SOUTH AUSTRALIA

http://www.csem.flinders.edu.au/people/pages/powers_david/

Copies of this poster, paper, spreadsheets, scripts may be obtained from: [email protected] or [email protected] or


[1] Powers, David M. W. (2003), Recall and Precision versus the Bookmaker, Proceedings of the International Conference on Cognitive Science (ICSC-2003), Sydney Australia, 2003, pp. 529-534. http://david.wardpowers.info/BM/index.htm

[2] Powers, David M. W. (2007) Evaluation, Flinders InfoEng Tech Report SIE07001 http://www.infoeng.flinders.edu.au/research/techreps/SIE07001.pdf

Contact Details

European Conference on Artificial Intelligence, July 2008, Patras Greece

David M. W. Powers is Associate Professor of Computer Science and Head of the AILab at Flinders University. He specializesin applications of unsupervised learning tolanguage and speech processing. Dr Powers undertook his PhD in this area, as well asco-founding ACL’s SIGNLL and CoNLL. He isalso a trader and has a Diploma in TechnicalAnalysis, being the study of how to find and exploit ‘edges’ in the financial markets.

+R −R +R −R

+P tp fp pp +P A B A+B

−P fn tn pn −P C D C+D

rp rn 1 A+C B+D N

Recall = tpr = tp/rp

Precision = tpa = tp/pp

Fallout = fpr = tp/rp

Imprecision = fna = fn/pnThe traditional measure for Machine Learning/Natural Language experiments have been

borrowed from Information Retrieval, where Precision reflects the accuracy of positive predictions, and Recall reflects the rate of return of real positives. These measures are totally independent of the number of negative cases that are truly predicted as negative (TN=D). The F-factor is a harmonic mean of Recall and Precision and thus also ignores these cases. The Rand Accuracy can be regarded as a weighted average of Recall and Inverse Recall, or of Precision and Inverse Precision (where the Inverse problem reverses positive and negative), and thus does reflect both. Often however, we do no know all the positive or negative cases, so we cannot calculate anything but Precision and derivatives of Recall based on the known subsets.

A second problem with all of these measures is that they are heavily influenced by two kinds of bias – Prevalence, the proportion of real positives, rp=A+C/N; and prediction Bias, the proportion of positive predictions, pp=A+B/N. A third issue is the confidence we have in these values, or the significance of the purported results.

Definition 1 Informedness quantifies how informed a predictor is for the specified condition, and specifies the probability that a prediction is informed in relation to the condition.Informedness = Recall + Inverse Recall – 1 = tpr-fpr = 1-fnr-fpr = (Recall − Bias) / (1−Prevalence)Confidence Interval CI = X.(1-|Informedness|/√[N-1]) where X= -Φ–1(α/sides)

Definition 2 Markedness quantifies how marked a condition is for the specified predictor, and specifies the probability that a condition is marked by the predictor.Markedness = Precision + Inverse Precision – 1 = tpa-fna = 1-fpa-fna = (Precision − Prevalence) / (1− Bias)Confidence Interval CM = CM = X.(1-|Markedness|/√[N-1]) where X= -Φ–1(α/sides)

Powers [1] derived an unbiased measure, Informedness, based on the idea of an “edge” in gambling, given knowledge of the underlying base probabilities and rewarding success and penalizing failures according fair odds (Bookmaker) for an arbitrary number of classes.

Dichotomous cases – Informedness, Markedness & CorrelationIn the case of just two classes (+ and –), Informedness can be understood in terms of ROC analysis either as the distance from a specific prediction system to the chance line, as the Unbiased Weighted Relative Accuracy (WRAcc for Skew=0), as twice the area between the curve and the chance line, or in terms of the Area Under the Curve (AUC) as 2AUC-1. In Psychology, Dichotomous Informedness corresponds to DeltaP'. DeltaP is identified empirically as the normative predictor of human associative judgements, as one concept primes or marks another. Whereas Informedness corresponds and DeltaP' and is the unbiased form Recall, DeltaP corresponds to Markedness, how much the actual class influences or marks the selected predictor, and relates to Precision, being dual regression coefficients whose produce is the Matthews Correlation, and associating these with Cramer’s V gives us the corresponding χ2 significance estimates. Similarly we can directly set confidence intervals [2].

Rand Accuracy = tp+tn

References

Illustration of ROC Analysis. The main diagonal represents chance. Points above the diagonal represent performance better than chance, those below worse than chance.

best

tpr

fpr fpr

good

perverse

sse0

B

sse1

p=0

0.2

I =

Chi2, G2 & Fisher significance (α=0.05)

(α=0.05)

(γ=0.05)

(β+=0.05)

(β+=0.05)

(β*=0.05)

- Monte Carlo Model Informedness, I• Bookmaker Informedness, B• Bookmaker Markedness, M• Bookmaker Correlation, C

Illustration of Significance and Confidence. 110 Monte Carlo simulations with 11 stepped expected Informedness levels (red line) with Bookmaker-estimated Informedness (red dots), Markedness (green dot) and

Correlation (blue dot), with significance (p+1) calculated using G2, χ2, and Fisher estimates, and confidence bands shown for both the theoretical Informedness and the B=0 and B=1 levels (parallel at B=0.18,0.82). The lower

theoretical band is calculated twice, using both CIB1 and CIB2. Here K=5, N=128, X=1.96 for two-sided α=β=0.05.

copies of this poster, paper, spreadsheets, scripts may be obtained from:...

Documents

recall inverse recall

precision inverse precision

recall bias

derivatives of recall

precision prevalence

harmonic mean of recall

weighted average of

inverse problem