introduction to medical statistics - knowledgenetintroduction to medical statistics with thanks to...
Post on 12-Jun-2020
5 Views
Preview:
TRANSCRIPT
Introduction to medical statistics
With thanks to the following people for the use of their PowerPoint presentations:Sarah Butler , Library & Information Skills Trainer, Brighton and Sussex NHS Library and Knowledge ServiceMark Kerr, Clinical Librarian, East Kent Hospitals University NHS Foundation Trust
Learning objectives
By the end of this session you will:
� understand how statistics are used in medical research
� interpret statistical tables in research papers� describe common medical statistical concepts� identify statistical inadequacies in research
The different types of statistics
Descriptive statistics - summarise the population and the results
Statistics to demonstrate difference (statistics for probability) - describe the results as comparisons between groups under study
Statistics for validity – describe the reliability of the study and how the results are applicable to others
Descriptive statistics
Summarise the population and the results
1) Numerical – where a value can fall at any point in a range (e.g. weight)
2) Categorical – where a value is selected from specific options (e.g. gender) – can be ‘nominal’ or ‘ordinal’
Some measurements can fall into either – BMI (e.g. 28, or ‘overweight’)
Different techniques are used to summarise each type of data.
Data distributions
Normal vs skewed data The type of data distribution matters when it comes to
summarising and (later) statistical testing
‘Averaging’ values - mean
Used to calculate the average where the data are ‘normally distributed’, ie a point is equally likely to appear above or below the mean:
To calculate the mean :� Add up all the values� Divide by the total number of values
1 + 1 + 2 + 3 + 4 + 5 +5 + 6 + 7 + 9 = 4343 / 10 = 4.3 = Mean
1 1 2 3 4 5 5 6 7 9
‘Averaging’ data - median
1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 5 6 6 7 9 15
Median is used for skewed data where values are not evenly distributed around a central value.
To calculate the median, line up all the values and find the centre value. If there is an even number of values, take the mean of the 2 centre values.
‘Averaging’ data - median
1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 5 6 6 7 9 15
Median is used for skewed data where values are not evenly distributed around a central value.
To calculate the median, line up all the values and find the centre value. If there is an even number of values, take the mean of the 2 centre values.
‘Averaging’ data - mode
1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 5 6 6 7 9 15
Mode is often used for categorical data where values cannot be added up. You identify the most frequent value. Here it would be 2.
What is the difference?
0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fre
quen
cy
Length of Stay
Median = 3
Mean = 4.1
Mode = 2
Summarising numerical results
We summarise numerical results by reporting:
Mean
Median
Standard Deviation
Inter-Quartile Range
1 1 2 3 4 5 5 6 7 9
3 3 4 4 4 5 5 5 5 7
AVERAGE
SPREAD
Inter-quartile range
1 1 2 2 2 2 2 2 3 3 3 3 4 4 4 5 6 6 7 9 15
The inter-quartile range (IQR) is the middle 50% of the
values.
2 2 2 3 3 3 3 4 4 4 5
0
2
4
6
8
10
12
60 65 70 75 80 85 90 95 100
The standard deviation measures how widely the
set of values is spread around the mean
Mean (SD)
80 kg (10 kg)
Standard deviation
0
2
4
6
8
10
12
60 65 70 75 80 85 90 95 100
The standard deviation measures how widely the
set of values is spread around the mean
Mean (SD)
80 kg (5 kg)
Standard deviation
Standard deviation
0
2
4
6
8
10
12
60 65 70 75 80 85 90 95 100
68.2% of results are between +1 and -1 standard deviations from the mean
Mean (SD)
80 kg (5 kg)
68.2%
95.4 % of results are between 2 standard deviations from the mean
99.7 % of results are between 3 standard deviations from the mean
Statistics for probability
Common terms used in comparing groups under study
• Absolute risk• Relative risk• Risk ratio• Hazard ratio• Odds ratioAll of these compare how often the event that
happens in Group X happens in Group Y
Counting the number of events
When measuring an event rate we count how many people experience the event…… and divide that number by the total number of people in the group
Ratio
Proportion
(Event) Rate
Percentage
Prevalence
So...
If we wanted to compare two groups for the number of people who fell over in a group, we would simply count the number of people who fell over in Group A and count the number of people who fell over in Group B. The number of people falling over could be expressed as a simple count, but to make comparison easier it is usually expressed as a %.
Absolute or relative difference
Difference can be Absolute or Relative
Absolute Difference: X – Ywhere X and Y are averages or proportions
or
Relative Difference: X ÷ Ywhere X and Y are proportions
Absolute or relative difference
If 60 out of 100 people in Group B suffer a fall, and 20 out of 100 people in Group A suffer a fall
the absolute difference = 60-20 = 40 people who fall
the relative difference = 60 ÷ 20 = 3(or you are 3 times more likely to suffer a
fall in Group B)
Event rates (proportions)
2x2 table Disease/ outcome
Disease/ outcome
Total
Yes No
Risk factor / Exposure
a b a + b
No risk factor / Control
c d c + d
• Exposure Event Rate = a ÷ (a + b)• Control Event Rate = c ÷ (c + d)
Event rates (proportions)
2x2 table Falls Falls Total
Yes No
Vitamin D(Group A)
20 80 100
No Vitamin D(Group B)
60 40 100
• Exposure Event Rate = 20 ÷ (20 + 80) = 20%• Control Event Rate = 60 ÷ (60 + 40) = 60%
Relative risk
Pfeifer M, Begerow B, Minne HW, et al. Effects of a short-term vitamin D and calcium supplementation on body sway and secondary hyperparathyroidism in elderly women. Bone Miner Res 2000;15:1113-8.
% of people who fell
Risk ratioVitamin D
and calciumCalcium
alone
16% 28% 0.57
Relative risk
Vitamin D and calcium
Calcium alone
% of people who fell 16% 28%
Relative Risk (RR)
= Exposure Event Rate ÷ Control Event Rate
= 16% ÷ 28%
= 0.57 or 57%
Relative risk reduction
Relative Risk Reduction (RRR)
= 1 – Relative Risk
= 1 – 0.57
= 0.43 or 43%
Vitamin D and calcium
Calcium alone
% of people who fell 16% 28%
Odds ratios
Odds are worked out differently to risks.
No. of people who experience outcome ÷
No. of people who don’t experience outcome
An odds ratio compares the odds of Group A experiencing an event compared to the odds of Group B experiencing an event.
Odds ratios
So, using the same falls example:If 11 out of 70 people fell in Group A, the odds of falling in that group are 11 ÷ 59 = 0.19If 19 out of 67 people fell in Group B, the odds of falling in that group are 19 ÷ 48 = 0.40
Odds ratio = 0.19 ÷ 0.40 = 0.48
Odds ratio
OR is particularly useful because as an effect-size statistic, it gives clear and direct information to clinicians about which treatment approach has the best odds of benefiting the patient.
Also used in cross-sectional studies and case-control studies, where exposure or not exposurereplaces treatment and control , and outcome is presence or absence of disease.
Odds versus risk
• If 50 in every 100 children are boys then:– Risk of having a boy = 50/100 = 0.5– Odds of having a boy = 50/50 = 1
• If 1 in 100 patients suffers a side-effect then:– Risk of having a side-effect = 1/100 = 0.01– Odds of having a side-effect = 1/99 = 0.01
Odds versus risk
Risk can be stated as “6 people die out of every 10 who are exposed”
Risks are a consequence of a risk leading to an outcome, whereas odds compare two groups, and can be reversed
Odds can be stated as “for every 4 people who recover, 6 people do not” (or for 6 who don’t, 4 do).
Absolute risk reduction and Number needed to treat
Absolute Risk Reduction (ARR) or Risk Difference
= Control Event Rate (CER) – Experimental Event Rate (EER)= 28% – 16% = 12%
Or
= Relative Risk Reduction (RRR) x Control Event Rate (CER)
= 1 – (0.16 / 0.28) = 0.43
= 0.43 x 0.28
= 0.12 (1%)
Vitamin D and calcium
Calcium alone
% of people who fell 16% 28%
Number needed to treat
Absolute Risk Reduction: CER – EER
or
Absolute Risk Reduction: RRR x CER
Number Needed to Treat: 1 ÷ ARR (or 100 ÷ ARR, if ARR expressed as a percentage)
[Number of people to treat with an intervention to prevent one outcome]
Number needed to treat
Risk of falls when on vitamin D and calcium
Pfeifer, 2000 Bischoff, 2003 Prince, 2008
Exposure Event Rate (EER) 11/70 16% or 0.16 14/62 23% or 0.23 80/151 53% or 0.53
Control Event Rate (CER) 19/67 28% or 0.28 18/60 30% or 0.3 95/151 63% or 0.63
Relative Risk (EER/CER) 57% or 0.57 77% or 0.77 84% or 0.84
Relative Risk Reduction(1-RR)
43% or 0.43 23% or 0.23 16% or 0.16
Absolute Risk Reduction(CER-EER)
12% or 0.12 7% or 0.07 10% or 0.1
Number Needed to Treat(1 ÷ ARR) or (100 ÷ARR, if ARR expressed as a percentage)
8 14 10
Definitions
� Risk: the number of participants having the event in a group divided by the total number of participants
� Odds: the number of participants having the event divided by the number of participants not having the event
� Risk ratio (relative risk): the risk of the event in the intervention group divided by the risk of the event in the control group
� Odds ratio: the odds of the event in the intervention group divided by the odds of the event in the control group
� Risk difference: the absolute change in risk that is attributable to the experimental intervention
� Number needed to treat (NNT): the number of people you would have to treat with the experimental intervention (compared with the control) to prevent one event (in a specific time period).
(EER = Experimental Event Rate, CER = Control Event Rate)
Statistical Validity
“Validity – the extent to which a test measures what it is supposed to
measure.” (Gosall 2009)
Statistical Validity
� The degree to which an observed result, such as a difference between two measurements, can be relied upon and not attributed to random error in sampling and measurement
� Sample Size – enough to detect true difference� Power – ability to detect a true difference� P – probability of results if null hypothesis is true� CI – the degree of uncertainty around an estimate
� To calculate the sample size, you need to know:� The minimum clinically important difference� The frequency (prevalence) and spread of data we might
expect - usually from previous studies� Type of study design (superiority, non-inferiority,
equivalence)� Type of primary outcome (dichotomous/continuous)
� General aim is to achieve valid outcome with smallest possible sample, for cost and practicality
Sample Size
� The evidence: a statement on sample size calculation and the expected sample – and the proof in the results that this was achieved:
CLOTS Trial: Lancet. 2009 June 6; 373(9679): 1958–1965.
Sample Size – the evidence
� The p value gives a measure of how likely it is that any differences between control and experimental groups are due to chance alone. P values range from 0 (impossible to happen by chance) to 1 (the event will certainly happen).
� p=0.001 unlikely result happened by chance: 1 in 1000� Strong evidence
� p=0.05 fairly unlikely result happened by chance: 1 in 20� Weak evidence, within a whisker of non-significance
� p=0.5 equally likely the result happened by chance: 1 in 2� Still some indication of benefit?
� p=0.75 very likely the result happened by chance: 3 in 4� No useful result?
Results where p is less than 0.05 are said to be “significant. ” This is just an arbitrary figure, in 1 in 20 cases, the results could be due to chance.
P (Probability) Value
P Values – just a first step
From:http://theconversation.com/the-problem-with-p-values-how-significant-are-they-really-20029
� Type 1 error = concluding a relationship exists between two variables, when in fact there is no relationship, leading us to reject the null hypothesis when it is actually true
� A study has avoided Type 1 error if P<0.05
Type I or ά Error
Type II or β Error� Type 2 error = concluding a relationship doesn’t exist
between two variables, when in fact there is a relationship, i.e. a high (poor) P value when the null hypothesis was correct
� A study has avoided Type II error if Power>80%
� Used in the same way as p values in assessing the effects of chance but gives more information.� Any result obtained in a sample of patients only gives an estimate of the
result which would be obtained in the whole population.
� The real value will not be known, but the confidence interval shows the size of the likely variation from the true figure.
� A 95% CI means a 95% chance that the ‘true’ result lies within the range specified. (Equivalent to a p value of 0.05). � The larger the trial the narrower the confidence interval, and therefore the
more likely the result is to be definitive.
� If the CI includes the point of zero effect (i.e. 0 for a difference, 1 for a ratio) it can mean either that there is no significant difference between the treatments and/or that the sample size was too small to allow us to be confident where the true result lies.
95% CI (Confidence Interval)
� Not about you recalculating statistics� Not about you accessing raw research data� Look for evidence in the study that potential errors have
been considered and managed� Achieving the sample, good power, adequate P & CI values
are just an indication that SOME errors have been avoided.� P-value cannot compensate for systematic error (bias) in a
trial. If the bias is large, the p-value is likely invalid and irrelevant.
It’s all about evidence...
Reading Statistical Diagrams
Forest Plots, Survival Curves, Life Expectancy Curves and ROC Curves
Reading Statistical Diagrams
� You’re appraising, not recalculating� First test significance, then what or how much� Read the words & numbers, not just the pictures!
Use of weaning protocols for reducing duration of mechanical ventilation in critically ill adult patients: Cochrane systematic review and meta-analysis
BMJ 2011;342:c7237
Odds Ratio Diagram – Forest Plot
Not all Forest Plots are Odds Ratios
BMJ 2011;342:c7237
Heterogeneity
� Χ2 = variation in results above that expected by chance – relates to DF (n of studies -1 is ‘perfect’), much higher suggests heterogeneity’� Low P value for Χ2 may indicate heterogeneity
� If Z Statistic > 2.2, then heterogeneity is present; Z should have an associated P value
Occurs where the results of different studies vary from each other more than might be expected by chance. Visually, on a Forest Plot, where the CI lines do NOT overlap. Significant heterogeneity would rule out meta-analysis, alternatives would include sub-group or sensitivity analysis.
Line width shows the CI, box size reflects the size of the group
3 sub-group analyses, each pair adds up to ‘All’ figures
Summary diamond shows overall total
Line of zero effect or unity
Effectiveness of thigh-length graduated compression stockings to reduce the risk of deep
vein thrombosis after stroke (CLOTS trial 1): a multicentre, randomised controlled trial
CLOTS Trial: Lancet. 2009 June 6; 373(9679): 1958–1965.
Odds Ratio Diagram – Forest Plot
� The survival curve is a graphical display of the Kaplan-Meier estimate that an event will occur
� Does not presume normal distribution� Log Rank test compares rates in 2 groups
� Measures time to an event following treatment (‘survival’), but may be non-mortality – revision of arthroscopy, time in remission before relapse, or positive (pregnancy, discharge)
� If sample large enough, the estimate approaches the true survival function for the population
� Allows inclusion of patients starting & leaving studies at different time intervals
Survival Curves
� Dropouts/mortality NOT due to target cause, but lost to follow-up, withdrawal from study
� Marked on curve but doesn’t affect analysis� Assumes loss to follow-up is independent of their
prognosis� For each event survival curve drops - denominator
changes, but plot stays the same, marked by ticks
Censored Data
Gijbels Irène. Censored data.WIREs Comp Stat 2010, 2: 178-188
A gap in the horizontal direction = “the median (50%) survival time is much larger (about 200 days larger) in the patients without cachexia”.
A gap in the vertical direction = “at 500 days, the probability of survival is about 45% in the patients without cachexia and only 25% in the patients with cachexia”.
Comparing/Describing Survival
� Vertical axis = estimated probability of survival for a hypothetical cohort, not actual % surviving.
� Precision depends on the number of observations: estimates at left-hand side are more precise than right-hand side (because of smaller numbers due to deaths and dropouts).
� Curves may give the impression that a given event occurs more frequently early than late, because of high survival rate and large number people at beginning.
� Rule of thumb is to truncate the x axis at the point where you only have 10 survivors, or 10% of the original cohort, whichever is higher, as reliability of curve diminishes as population survival reduces
Survival Curves
Cumulative morbidity plots are often better than survival plots when overall survival is high
� Hazard is a measure of how rapidly the event occurs. The hazard ratio compares the hazards in two groups.
� If a hazard ratio is, say, 4.17, the estimated relative risk of the event in group 2 is 4.17 higher than in group 1.
� The hazard ratio is significant if the confidence interval does not include the value 1.
� Note: calculating the hazard ratio assumes the ratio is consistent over time - if the survival curves cross, the hazard ratio should be ignored.
Hazard Ratios
Relative survival for Merkel cell carcinoma by exte nt of disease at time of diagnosis . Percent relative survival was calculated for cases in the National Cancer Database using age- and sex-matched control data from the Centers for Disease Control and Prevention
http://hematology.wustl.edu/conferences/presentations/Rokkam20091211.ppt
Survival Curves
Studenski S, Perera S, Patel K, Rosano C, Faulkner K, Inzitari M, et al. Gait speed and survival in older adults. JAMA 2011;305:50-8.
Life Expectancy Curve
Diagnostic Test Study Statistics
How good is the screening/diagnostic test at predicting/confirming the outcome of
the Gold Standard test?
Test & Disease probability
Zone of uncertainty Treatment zoneDischarge zone
0% chance of disease
100% chance of disease
Before doing the test, probability of disease (pre-test
probability) is in this zone
After doing the test, we want the probability of disease (post-test probability) to be in one of these two zones
GS-ve
GS+ve
Test-discharge threshold
Test-treatment threshold
Key Screening Questions
� Is the test useful?� Was it researched in a population relevant to the
individual or population in whish it will be used?
� Is the test reliable?� Can it be repeated and the effects reproduced using
the same of different observers?
� Is the test valid?� Does it measure what it sets out to measure and is
the result true, when compared with the gold standard?
Biases to avoid – or identify
� Spectrum bias� Tested on ‘healthy’ as well as ‘ill’ subjects
� Verification/Ascertainment bias� ALL patients get BOTH tests
� Review bias� Proper blinding to avoid influencing test results
� Lead time bias� Earlier test without change in outcome
Lead time bias
http://en.wikipedia.org/wiki/Lead_time_bias
Where an earlier test implies longer survival, but actually there is no difference in clinical outcome, so what seems like an effective early test (breast screening, genetic test for Huntingtons) causes no real benefit, and may cause harm (anxiety etc).
Gold Standard
For the purposes of testing the screening test, the Gold Standard test is assumed to have 100% accuracy.
It reports the prevalence of the condition, or the baseline risk, or the ACTUAL rate of the condition in the study group.
Generally speaking, it is the ‘definitive’ test – a sputum culture for TB (vs. Blood or breath test), a blood test for diabetes (vs. dipstick), biopsy for breast cancer (vs. Mammogram)
Sensitivity
DiseaseNo
Disease
Test ResultPositive
Negative
TP FP
FN TN
Gold Standard
SensitivityTP/TP+FN
Sensitivity: The capacity of the test to correctly identify diseasedindividuals in a population; “TRUE POSITIVES”.
Specificity
DiseaseNo
Disease
Positive
Negative
TP FP
FN TNTest Result
Gold Standard
SpecificityTN/FP+TN
Specificity: The capacity of the test to correctly exclude individuals who are free of the disease; “TRUE NEGATIVES”.
Example
DiseaseNo
Disease
Positive
Negative
75
25
20
180
100 200 300
95
205
Sensitivity = 75/100 = 75% Specificity = 180/200 = 90%
Test Result
Gold Standard
Accuracy of the test
(a+d)/(a+b+c+d)
DiseaseNo
Disease
Positive
Negative
a
c
b
d
a+c b+d
a+b
c+d
Test Result
Gold Standard
Likelihood ratios
� Reflects the degree of confidence that a person who scores in the positive range does have the disorder, or in the negative range does not have the disorder
� LR+ = sensitivity/1-specificity� LR- = 1-sensitivity/specificity� The higher the LR+ the more useful the indicator
for identifying people with the disorder� The higher the LR-, the more useful the indicator
for identifying people without the disorder
Worked example:
Prevalence of 30%, Sensitivity of 50%, Specificity of 90%
30
70
15
7
100
22 people test positive……….
of whom 15 have the disease
So, chance of disease is 15/22 about 70%
Disease +ve
Disease -ve
Testing +ve
Sensitivity = 50%
False positive rate = 10%
Positive Predictive Value
DiseaseNo
Disease
Positive
Negative
TP FP
FN TNTest Result
Gold Standard
PPV=TP/TP+FP
PPV: the probability of the disease being present, among those with positive diagnostic test results
Negative Predictive Value
DiseaseNo
Disease
Positive
Negative
TP FP
FN TNTest Result
Gold Standard
NPV=TN/TN+FN
NPV: the probability of the disease being absent, among those with negative diagnostic test results
Example
5000 pregnant women underwent a test for blood
glucose at 24 weeks, following a glucose load. 243
women were found to have a blood glucose greater
than 6.8 mmol/L and were referred for an OGTT. 186
were found to have gestational diabetes. Four women
who initially had tested negative were diagnosed as
having diabetes later in their pregnancy.
The 2x2 Table
Diabetes No diabetes Total
Positive 186 57 243
Negative 4 4753 4757
Total 190 4810 5000
Diagnostic calculator: http://ktclearinghouse.ca/cebm/toolbox/statscalc
The Sums
Prevalence
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Likelihood ratio + test
Likelihood ratio - test
Accuracy
190/5000
186/190
4753/4810
186/243
4753/4757
(186/190)/(57/4810)
(4/190)/(4753/4810)
(186+4753)/5000
3.8%
97.9%
98.8%
76.5%
99.9%
82.6
.02
98.8%
The Fagan Nomogram:
If you know 2 of the 3 elements, then you can calculate the third, and see the results of changes
i.e. for a known prevalence, you can adjust the likelihood ratio to see how it affects the post-test probability
Prev = 3.8% (0.038)
LR+ = 82.6
LR- = 0.02
ROC Curves:Breath Test for Biomarkers of TB
Sensitivity: 71/2%Specificity: 72%Accuracy: 80%Prevalence: 5%
� Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
� The area under the ROC curve is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
� Not just ‘diagnosis’ but also ‘prediction’
ROC Curves
� Represents the trade off between the false negative (sensitivity) and false positive (specificity) rates for every possible combination.
� If the ROC curve rises rapidly towards the upper right-hand corner of the graph, or if the value of area under the curve is large, we can say the test performs well.
� Area = 1.0 = an ideal test, because it achieves both 100% sensitivity and 100% specificity (i.e. the curve hits the top left corner, where both are 100%). Area = 0.5 = ‘bad test’, as it doesn’t show a clear benefit of the test.
ROC Curves
� LR is the likelihood that a given test result would be expected in a patient with the target disorder compared to the likelihood that that same result would be expected in a patient without the target disorder.
� More useful than sensitivity/specificity:� less likely to change with prevalence of disorder� can calculate for several levels of symptom/sign/test� can be used to combine results of multiple tests� can be used to calculate post-test probability
Likelihood Ratios
� A good test should have a LR+ of at least 2.0 and a LR-of 0.5 or less. This would correspond to an AUC of roughly 0.75. A better test would have likelihood ratios of 5 and 0.2, respectively, and this corresponds to an AUC of around 0.92.
� 0.50 to 0.75 = fair � 0.75 to 0.92 = good � 0.92 to 0.97 = very good � 0.97 to 1.00 = excellent
ROC Curves & Likelihood ratios
Clinical interpretation: “maximum proportional reduction in expected regret”
Measures the optimal cut-off point, the ‘best’ trade-off between sensitivity and specificity
Calculated as sensitivity+specificity>1For a test to be useful, then sensitivity + specificity > 1 (Youden Index > 0)
Statistics in Medicine 1996; 15: 969–86.
Youden Index
top related