measuring outcomes

Measuring Outcomes

Scott Harris October 2009

Comparing a measurement to a ‘true’ value or ‘Gold standard’

Setting up your “Zapper”• Press “GO” “4” “1” “GO”

on your handset to match the unique frequency for this session.

• You have set it up correctly if an orange-green light is on (for a short period of time), otherwise a red light will show then you will have to try again.

• You will be using the handset to answer some questions during this session.

• All handsets need to be collected at the end of the session or when you leave the lecture room.

Learning outcomes

By the end of this session you should be aware of techniques for assessing the accuracy/precision of measuring the following ‘true’ outcome types with the following measures:

– Continuous vs. continuous

– Binary vs. binary

– Binary vs. continuous

Contents• Introduction

• Continuous vs. continuous

– The most common mistake

– Features of interest (Precision and accuracy)

– Setting up a Bland-Altman plot (with examples)

• Binary vs. binary

– Sensitivity / Specificty

– Positive predictive value / Negative predictive value

• Binary vs. continuous

– Diagnostic tests

– Receiver operating characteristic (ROC) curves

Statistics

Statistical analysis• The full set of statistical analysis techniques are like a

cookbook and depending on the ingredients available (the data) only certain analyses will be appropriate.

How you record your measurements directly affects the analysis options open to you and this should be considered before data collection.

• The statistics section of the RSB course will go through the different tests that are used in different situations with different ‘types’ of data. These sessions will look at associations, relationships and group differences.

Statistical Support

Statistical support for SoM PhD students is available from:

– Dr Ruth Pickering ([email protected] )

– Brian Yuen ([email protected] )

– Scott Harris ([email protected] )

Up to 3 (1hr) consultations per student.

mailto:[email protected]



Types of data• Quantitative – a measured quantity.

– Continuous – Measurements from a continuous scale: Height, weight, age.

– Discrete – Count data: Children in a family, number of days in hospital.

• Qualitative – Assessing a quality.– Ordinal – An order to the data: Likert scale (much worse,

worse, the same, better, much better), age group (18-25, 26-30…).

– Categorical / Nominal – Simple categories: Blood group (O, A, B, AB). A special case is binary data (two levels): Status (Alive, dead), Infection (yes, no).

Measurement comparison studies• Also known as method comparison studies or

measurement error studies.

• Can be either a measurement measured a number of times or a number of alternative measures for the same outcome.

• Techniques discussed here will focus on only 2 measurements at a time. More advanced techniques do exist looking at repeatability when more than 2 measurements are taken (see a statistician for these).

Measures of a ‘true’ or Gold standard valueAim:

To determine the ability of a new test to replicate the score of a ‘true’ or gold standard value accurately and precisely or to distinguish those who have the feature of interest from those who do not.

Procedure:

• Recruit sample of patients or take a sample of measures in your population of interest. Need to include patients both with and without the outcome of interest if we are dealing with a binary outcome.

• All recipients undergo the new test (index test).

• Test result is compared against knowledge of true status (reference test or ‘gold standard’ test).

Reference test

Reference test = ‘Gold standard’

Often imperfect:

• Often invasive or unpleasant (e.g. liver or other biopsy).

• May require long term follow-up (may take too long to arrive).

• May be resource intensive (time, money etc.)

• As close to the truth as possible. (assumed to be the actual truth).

Continuous vs. continuous

Accuracy and precision.

Bland-Altman plot.

Most common mistake

M. Goyal et al : Comparison of Wright scale and European scale peak flow meters with digital spirometer . The Internet Journal of Pulmonary Medicine. 2008 Volume 9 Number 2

Comparison of European scale peak flow meters with digital spirometer.

The most common mistake is that researchers will draw a scatter plot of the two measures against each other and either calculate a correlation or a linear regression. If these results are significant they then conclude agreement between the measures. This would be INCORRECT.

The solid red line on the plot shows an approximate line of ‘best fit’. The true line of equivalence is shown with the dotted red line.

Precision

Acc

ura

cy

These images illustrate two features of random and systematic errors:

• A Random error (measurement error) affects the precision, whereas a systematic error (bias) affects accuracy.

• A test score can be highly precise (reliable), without being accurate.

An ideal test score is both accurate and precise.

Producing a Bland-Altman plot• Calculate the mean of your two measures for each

observation.

• Calculate the difference between your two measures for each observation.

• Calculate the mean difference across all observations.

• Calculate the Standard deviation of the difference and multiply it by 1.96.

• Draw a scatter plot as shown on the next slide:

Bland-Altman plot

Add in reference lines at the mean difference and the mean difference +/- 1.96xSD. These are then your limits of agreement. If this interval is narrow enough you can conclude agreement between the 2 measures.

The mean value is plotted on the x-axis, with the difference on the y-axis.

The Bland-Altman plot for CUM15, the index that was used to express cough sensitivity in hypertonic saline challenge.

Koskela et al. Cough 2008 4:8

“Bias between the techniques was not observed, as indicated by a non-significant p value (p = 0.648, p = 0.408 and p = 0.665,

respectively).”

Minderico et al. Nutrition & Metabolism 2006 3:32

BaselineAfter weight-

lossDifference

Binary vs. binary

Sensitivity, specificity, positive and negative predictive values.

Classification of results

DiseaseNo

Disease

Positive test

result

True Positives

(a)

False Positives

(b)

Negative test

result

False Negatives

(c)

True Negatives

(d)

Theory: Sensitivity and specificitySensitivity:

Of those with the disease, the proportion with a positive test result.

Specificity:

Of those without the disease, the proportion with a negative test result.

Sensitivity and specificity are inversely related to one another - If the level used to distinguish between a positive and a negative test is shifted, sensitivity and specificity will move in opposite directions.

Theory: Sensitivity

Disease No Disease

Positive test result

True Positives

False Positives

Negative test result

False Negatives

True Negatives

Sensitivity = a / (a + c)

Theory: Specificity

Disease No Disease


True Positives

False Positives


False Negatives

True Negatives

Specificity = d / (b + d)

Sensitivity & Specificity: Example

Disease

No Disease


62 5


7 34

Sensitivity = a / (a + c) = 62/69 = 89.9%

Specificity = d / (b + d) = 34/39 = 87.2%

What does this mean for an individual?

• The problem with sensitivity and specificity is that they are not useful for looking at results for individual patients.

• If a person has a positive test for some outcome then what we want to say is how likely is it for that person to actually have the outcome.

• This is where an alternative, positive and negative predictive values come in:

Positive predictive value (PPV):

Of those with a positive test result, the proportion who truly have the disease.

Negative predictive value (NPV):

Of those with a negative test result, the proportion who are truly without the disease.

Theory: Positive predictive value (PPV)

DiseaseNo

Disease


True Positives

False Positives


False Negatives

True Negatives

PPV = a / (a + b)

Theory: Negative predictive value (NPV)

DiseaseNo

Disease


True Positives

False Positives


False Negatives

True Negatives

NPV = d / (c + d)

Disease

No Disease

Positive test

result62 5

Negative test

result7 34

PPV = a / (a + b) = 62/67 = 92.5%

NPV = d / (c + d) = 34/41 = 82.9%

PPV & NPV: Example

Binary vs. continuous

Receiver operating characteristic (ROC) curve, sensitivity, specificity, positive and negative predictive values.

Perfect diagnostic test performance

Cut-off value

Normal Diseased

Predict as normal Predict as Diseased

Test Score

Perfect diagnostic test performance

Imperfect separation? – Real world!

Test value

Normal Diseased

Imperfect test. Is it good enough?

Imperfect separation?

Test value

DiseasedNormal

True negatives True positives

Imperfect separation?

Test value

Normal Diseased

False negatives False positives

Classification of results

DiseaseNo

Disease

Positive test

result

True Positives

(a)

False Positives

(b)

Negative test

result

False Negatives

(c)

True Negatives

(d)

Cut point: Maximising Sensitivity

Test value

Normal Diseased

Test value

Normal Diseased

Cut point: Maximising Specificity

More ‘Normal’ than ‘Diseased’

Test value

False negatives False positives

DiseasedNormal

Prevalence = All with disease Population

Theory: PPV & NPVWhereas sensitivity and specificity were looking at the group truly with the outcome (sensitivity) or the group truly without the outcome (specificity) separately, PPV and NPV look at the ratio of parts of both of these groups.

This means that PPV and NPV are linked with prevalence, whereas sensitivity and specificity are not.

Low prevalence - The number of negative test results will be much larger than false negatives: NPV very high but not very informative.

High prevalence - The number of positive test results will be much larger than false positives: PPV very high but not very informative.

Quite often we have a continuous value for our new test and we could choose various ‘cuts’ to create a binary prediction of status. An ROC curve demonstrates the effect of varying the cut point on sensitivity and specificity:

• Plots a curve of Sensitivity vs. (1 - Specificity) for all cut points that would alter at least one classification.

• The top left corner of the plot signifies perfect performance.

• Often includes a diagonal line, indicating an uninformative test.

• The larger the area under the curve the better the test across the range of cut points.

• The nearest one point gets to perfect performance the better the test performance using the single best cut point.

Receiver Operating Characteristic (ROC) curves

Better performance

Perfectperformance

Receiver Operating Characteristic (ROC) curves

ROC curves: SPSS - Customised output

Practical Questions

A new test for a disease was evaluated on 50 affected individuals and 50 who were known to be unaffected. The test correctly identified 40 of the affected subjects, but incorrectly classified 5 of the unaffected ones.

I. What is the sensitivity of the test?

II. What is the specificity of the test?

III. What is the positive predictive value of the test?

IV. What is the negative predictive value of the test?

V. If the real disease prevalence is 2%, how likely is it that someone with a positive test has the disease?

Practical Questions

Practical Questions (I to IV)

Sensitivity =Specificity =PPV=NPV=

DiseaseNo

DiseaseTotal



Total

Practical Questions (V)

Sensitivity =Specificity =PPV=NPV=

DiseaseNo

DiseaseTotal



Total

SummaryYou should now be aware of techniques for assessing the accuracy/precision of measuring the following ‘true’ outcome types with the following measures:

– Continuous vs. continuous

• Bland-Altman plot

– Binary vs. binary

• Sensitivity/specificity/PPV/NPV

– Binary vs. continuous

• ROC curves and sensitivity/specificity/PPV/NPV

References• Bland JM, Altman DG. (1986). Statistical methods for assessing

agreement between two methods of clinical measurement. Lancet, i, 307-310.

• http://statpages.org/ctab2x2.html

• Swets JA, Pickett RM. Evaluation of diagnostic systems. New York: Academic Press,1982.

• Langlotz CP. Fundamental measures of diagnostic examination performance: usefulness for clinical decision making and research. Radiology 2003; 228:3-9.

• Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143:29-36.

• Hanley JA, McNeil BJ. A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology 1983; 148:839-843.

measuring outcomes

Documents