grade toni tan, centre for clinical practice. grade the grading of recommendations assessment,...

GRADE

Toni Tan, Centre for Clinical Practice

GRADE

The Grading of Recommendations Assessment, Development and

Evaluation

GRADE“A systematic and explicit approach to making

judgements about the quality of evidence, and

the strength of recommendations can help to

prevent errors, facilitate critical appraisal of these

judgements, and can help to improve

communication of this information.”

Organisations that have adopted GRADE methodology

Agency for Healthcare Research and Quality (USA)Agenzia Sanitaria Regionale (Italy)American College of Chest Physicians (USA)American College of Physicians (USA)American Thoracic Society (USA)Arztliches Zentrum fur Qualitat in der Medizin (Germany)British Medical Journal (United Kingdom)BMJ Clinical Evidence (United Kingdom)COMPUS at The Canadian Agency for Drugsand Technologies in Health (Canada)The Cochrane Collaboration (International)EMB Guidelines (Finland/International)The Endocrine Society (USA)European Respiratory Society (Europe)European Society of Thoracic Surgeons(International)Evidence-based Nursing Su¨ dtirol (Italy)German Center for Evidence-based Nursing‘‘sapere aude’’ (Germany)Infectious Diseases Society of America (USA)Japanese Society for Temporomandibular

Joint (Japan)Journal of Infection in Developing Countries(International)Kidney Disease: Improving Global Outcome(International)Ministry for Health and Long-Term Care,Ontario (Canada)National Board of Health and Welfare (Sweden)National Institute for Health and CareExcellence (United Kingdom)Norwegian Knowledge Centre for the HealthServices (Norway)Polish Institute for EBM (Poland)SIGN (UK, Scotland)Society for Critical Care Medicine (USA)Society for Vascular Surgery (USA)Spanish Society for Family and CommunityMedicine (Spain)Surviving Sepsis Campaign (International)University of Pennsylvania Health SystemCenter for Evidence-Based Practice (USA)UpToDate (USA)World Health Organization (International)

‘Traditional’ approachChecklist system

• Selection bias: randomisation, concealment of allocation, comparable at baseline

• Performance bias: blinding (patients & care providers), the comparison groups received the same care apart from the intervention studied.

• Attrition bias: systematic differences between the comparison groups with respect to participants lost

• Detection bias: appropriate length of follow-up, definition of outcome, blinding (investigators)

++ All or most of the criteria have been fulfilled. Where they have not been fulfilled the conclusions of the study or review are thought very unlikely to alter.

+ Some of the criteria have been fulfilled. Those criteria that have not been fulfilled or not adequately described are thought unlikely to alter the conclusions.

Few or no criteria fulfilled. The conclusions of the study are thought likely or very likely to alter.

‘Traditional’ approach Narrative summary

For example, AIP guidelineMortality rates

One cluster RCT from the UK investigated the effectiveness of CCOT on hospital mortality using PAR score……found a significant reduction in hospital mortality in patients in the intervention wards at cluster level (OR = 0.523, 95% CI 0.322 to 0.849). The cluster RCT from Australia found no difference in unexpected death (without do-not-resuscitate order) (secondary outcome) between control group and intervention group (per 1000 admissions: control = 1.18, intervention = 1.06, difference = −0.093 [−0.423 to 0.237], 95% CI: −0.423 to 0.237; adjusted p = 0.752, adjusted OR = 1.03, 95% CI 0.84 to 1.28).

Evidence statement:(1+) There were conflicting findings in the two included studies on mortality

rates: the Priestley and coworkers study found a significant reduction in mortality (but failed to report do-not-resuscitate orders), but MERIT found no difference between the two arms of the study for this outcome.

GRADE

• Interventional studies of effectiveness• Currently in development for diagnostic accuracy studies,

prognostic and qualitative studies• Makes sequential appraisal about:

– The quality of evidence across studies for each critical/important outcome (instead of individual study)

– Which outcomes are critical to a decision– The overall quality of evidence across these critical outcomes– The balance between benefits and harms

• Result is an assessment of:– quality of the evidence for an outcome– strength of the recommendations

• Perspective of guideline developers

GRADE profile

Quality assessment No of patients Effect

Quality Importance No of studies Design Ris

k o

f b

ias

Inc

on

sis

ten

cy

Ind

irec

tne

ss

Imp

rec

isio

n

Oth

er

co

nsid

era

tio

ns

SMBG SMUG

Relative (95% CI) Absolute

Change in Hba1c (%; Better indicated by lower values)

3 (Allen 1990, Lu 2011, Fontbonne 1989*)

RCTs S1 N N S2 none 61 63 - MD 0.15 higher (0.37 lower to 0.67 higher). See figure 14

Low CRITICAL

Change in Fasting Blood Glucose (FBG mmol/L; Better indicated by lower values)

2 (Allen 1990, Lu 2011)

RCTs N N N S2 none 61 63 - MD 0.35 lower (1.45 lower to 0.74 higher). See figure 15

Moderate CRITICAL

Change in weight (Kg; Better indicated by lower values)

1 (Allen 1990) RCT S1 NA N S2 none 27 27 - MD 2 higher (0.3 to 3.7 higher)

Low IMPORTANT

1 Downgrade by one level: studies conducted before 1995 when the management of diabetes and other related condit ions may have differed compared with current practice 2 Downgrade by one level: The 95% confidence interval passes through the minimal important difference (MID) which is 0.5% for change in Hba1c levels, 1 mmol/L for fasting blood glucose, 1 mmol/L for postprandial blood glucose, 5kg for body weight, 3 BMI point and 3 cm for waist circumference. For all other outcomes a change of 0.5 for continuous outcomes or a relative risk reduction or increase of 25% or more for binary outcomes were considered clinically important.

Why do we use GRADE in NICE clinical guidelines?

• Concerns about the sometimes inappropriate direct link between study design and recommendation strength

• Anecdotal evidence that recommendations not based on evidence from trials were being ignored

• WHO evaluation of the NICE clinical guidelines programme

• Just being explicit about what we had been doing anyway!

How GRADE works?

PICO

OutcomeOutcomeOutcomeOutcome

Formulate

question

Rate

importa

nce

Critical

Important

Critical

Not important

Create

evidence

profile with

GRADEpro

Summary of findings & estimate of effect for each outcome

Present evidence profile(s) to GDG

RCT start high, obs. data start

low1. Risk of bias2. Inconsisten

cy3. Indirectnes

s4. Imprecision5. Other

consideration

Gra

de d

own

Gra

de u

p 1. Large effect

2. Dose response

3. Confounders

Rate quality

of evidence

for each

outcomeSelect

outcomes

Very low

LowModerate

High

Develop recommendations:•For or against (direction)•Strong or weak (strength)

By considering:Relative value of different

outcomesQuality of evidenceTrade off - benefits/harmsHealth economicsOther considerations

• “Offer xyz…”• “Consider xyz…”• “Do not use xyz…”

Outcomes

across

studies

GRADE concept of quality of evidence

– The quality of evidence reflects the extent to which our confidence (certainty) in an estimate of the effect is adequate to support a particular recommendation.

– Guideline panels must make judgements about the quality of evidence relative to the specific context for which they are using the evidence.

How is this achieved?

• Transparent framework to consider confidence (certainty) of an effect estimate through assessingo Systematic errors (bias)o Chance errors (random errors)

• Using criteriao Systematic errors (bias)o Limitations, Indirectness, Inconsistency

– Chance errors (random error)o Imprecisiono Other considerations (any other factors)

GRADE Definitions

High Further research is very unlikely to change our confidence in the estimate of effect.

Moderate Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.

Low Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate.

Very low Any estimate of effect is very uncertain.

GRADE diagram

Grading quality of evidence

What is the methodology of the best available evidence?

RCT Uncontrolled studiesObservational study

Assume “high quality”

Assume “low quality” “very low quality”

Factors lowering the quality present?

Factors lowering the quality present?

If NO –Factors increasing the

quality?

If NO –Stay “low quality”

If YES –Upgrades

Become “moderate or high quality”

If YES –Downgrades

If NO –Stay “high quality”

Become “moderate, low or very low quality”

If YES – downgrade to“very low quality”

Determining the quality of evidence

• Limitations• Inconsistent results• Indirectness• Imprecision• Other considerations

– Large or very large effect

– Plausible biases underestimate true effect

– Dose-response gradient

– All of above can upgrade 1 level (2 for large magnitude of effect)

Limitations or ‘risk of bias’ - RCTslimitation explanation

Lack of allocation concealment

Those enrolling patients are aware of the group to which the next enrolled patient will be allocated (major problem in “pseudo” or “quasi” randomized trials with allocation by day of week, birth date, chart number etc.)

Lack of blinding Patient, caregivers, those recording outcomes, those adjudicating outcomes, or data analysts are aware of the arm to which patients are allocated

Incomplete accounting of patients and outcome events

Loss to follow-up and failure to adhere to the intention to treat principle when indicated

Selective outcome reporting Reporting of some outcomes and not others on the basis of the results

Other limitations For example:stopping early for benefit observed in randomized trials, in particular in the absence of adequate stopping rules use of unvalidated patient-reported outcomes carry-over effects in cross-over trials recruitment bias in cluster-randomized trials

Risk of bias – observational studieslimitation explanation

Failure to develop and apply appropriate eligibility criteria (inclusion of control population)

under- or over-matching in case-control studies selection of exposed and unexposed in cohort

studies from different populations

Flawed measurement of both exposure and outcome

differences in measurement of exposure (e.g. recall bias in case- control studies)

differential surveillance for outcome in exposed and unexposed in cohort studies

Failure to adequately control confounding

failure of accurate measurement of all known prognostic factors

failure to match for prognostic factors and/or adjustment in statistical analysis

Incomplete or inadequately short follow-up

Inconsistency• When heterogeneity exists, but no plausible explanation is

identified, the quality of evidence should be downgraded by one or two levels, depending on the magnitude of the inconsistency in the results.

• Inconsistency may arise from differences in:

– populations (e.g. drugs may have larger relative effects in sicker populations)

– interventions (e.g. larger effects with higher drug doses)

– outcomes (e.g. diminishing treatment effect with time).

• Account for this where possible

IndirectnessIndirect Question of interest Source of indirectness

Comparison

Relative effectiveness of alendronate and risedronate in osteoporosis.

Randomized trials compared alendronate to placebo and risedronate to placebo, but trials comparing alendronate to risedronate are unavailable.

Population Oseltamivir for prophylaxis of avian flu caused by influenza A (H5N1) virus.

Randomized trials of oseltamivir are available for patients with seasonal influenza, but not for avian influenza.

Intervention Sigmoidoscopic screening for prevention of colon cancer mortality.

Randomized trials of fecal occult blood screening provide indirect evidence bearing on the potential effectiveness of sigmoidoscopy.

Comparator Choice of medication for schizophrenia.

A series of trials comparing newer generation neuroleptic agents to fixed doses of 20 mg of haloperidol provide indirect evidence of how the newer agents would compare to the lower, flexible doses of haloperidol clinicians typically use.

IndirectnessCondition Patient-important outcome(s) Surrogate outcome(s)

Diabetes Diabetic symptoms, admission, complications (cardiovascular, eye, renal, neuropathic etc.)

Glucose, HbA1C

Dementia Patient function, behaviour, caregiver burden

Cognitive function

Osteoporosis Fractures Bone density

ARDS Mortality Oxygenation

End-stage renal disease

Quality of life, mortality Hemoglobin

Venous thrombosis Symptomatic venous thrombosis Asymptomatic venous thrombosis

Chronic respiratory disease

Quality of life, exacerbations, mortality

Pulmonary function, exercise capacity

Cardiovascular disease/risk

Vascular events, mortality Serum lipids

Imprecision• Our estimates of the population value are

uncertain/imprecise because we use samples• GRADE extended the term uncertainty in the

context of whether the effect estimate reaches the ‘clinical minimal important difference’ (MID)

Example of MID:

Drug X compared to placebo to reduce severe migraine.

Pain on migraine: measured on a 10-point scale

Mean baseline = 9.5; Mean reduction from baseline = 1.7 (95%CI: 1.2 to 2.3)

But survey on migraine patients said pain reduction less than 3 points is

meaningless because it does not improve their overall QoL and daily function.

Confidence intervals - summary

• Easiest way to approach effect of random error on evidence quality

• In frequentist approach, 95% CI represents– A range constructed so that in repeated

experiments 95% would include the population value

– Usually interpreted as p=0.95 that the population value is in the CI

Confidence interval width• Wide confidence intervals imply uncertainty over whether

our observed effect is close to or far away from the real effect

• Examples– An RCT of supervised exercise for patellofemoral pain– Self reported recovery at 12 months– T: 9/500 vs SC: 2/500 RR=4.50 (1.00 to 20.77)– We’d probably agree that’s imprecise

– An RCT of drug A for patellofemoral pain– Self reported recovery at 12 months– T: 350/500 vs PC: 150/500 RR=2.33 (2.20 to 2.72)– We’d probably agree that’s precise

What affects imprecision?

• Having larger samples, but particularly where there is more ‘information’– Complex relationship between sample size,

numbers of events

• Easiest to play with an exampleControl event rate

Treatment event rate

RR, % RRR, % Calculated 95%CI

2/4 10/20 20/40 50/100 500/1000

1/4 5/20 10/40 25/100 250/1000

50 50 50 50 50

50 50 50 50 50

-174 to 92 -14 to 79.5 9.5 to 73.4

26.8 to 66.4 43.5 to 55.9

Remember CIs can mislead

• True values will be outside a 95%CI 5/100 times• CI based on small numbers of events are

unstable• Early trials tend to be more positive• Trials stopped early likely to be biased• So, if you have small trials with a positive effect

and apparently narrow CI, be sceptical• It would be helpful to have an objective idea of

when we have ‘enough’ information

Control group event rate

Tot

al s

ampl

e si

ze r

equi

red

0.2 0.4 0.6 0.8 1.0

010

0020

0030

0040

0050

0060

00

RRR=30%

RRR=25%

RRR=20%

Figure 4: Optimal information size given alpha of 0.05 and beta of 0.2

for varying control event rates and RRR of 20%, 25% and 30%

For any chosen line, evidence meets

optimal information size criterion if

above the line

Optimal information size (OIS)• We want at least as many observations in a trial as we

would calculate in a sample size calculation

Warning – ‘Power-based’ sample size calculation is for ‘hypothesis

testing’ using p-value, not for estimation of true effect

Control group event rate

Tota

l nu

mb

er

of

eve

nts

re

qu

ire

d

0.0 0.2 0.4 0.6 0.8 1.0

02

00

40

06

00

RRR=30%

RRR=25%

RRR=20%

300 events

Figure 5: Optimal information size presented as number of eventsgiven alpha of 0.05 and beta of 0.2

for varying control event rates and RRR of 20%, 25% and 30%

For any chosen line, evidence meetsoptimal information size criterion if above the line

OIS continued• Thinking of numbers of events may be easier, and could just

use arbitrarynumber if don’t have resources to calculate OIS

Summary of suggested approach to imprecision

Red: mean -1 as MID

1 = ‘no effect’ and precise; 2 = ‘no effect’ but not precise; 3 = ‘effective’ and precise

Green: mean -2 as MID

1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and precise

Blue: mean -3 as MID

1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and not precise

-4 -3 -2 -1 0 +1 +2 +3 +4

Mean pain reduction

Mean pain increase

1

2

3

What if we don’t know a threshold?

• Can use an arbitrary threshold– For example, GRADE suggests RRR or RRI

of 25%– Often used in NICE guidelines

Two things to remember about GRADE

• Many judgements are made in appraising evidence, and there will always be disagreement. The important thing is to make the areas of disagreement transparent.

• The concepts we are judging e.g. imprecision are continuous, and dichotomising it (downgrade or not) can be a close call. Where it is, the evidence to recommendations section should discuss it

PDE-5 inhibitor vs. placeboQuality assessment Number of patients

No of studies Design

Ris

k o

f b

ias

Inc

on

sis

ten

cy

Ind

ire

ctn

es

s

Imp

rec

isio

n

Oth

er

co

ns

ide

rati

on

s Intervention Placebo Effect/ outcome Quality Importance

Erectile Function-International Index of Erectile Function [IIEF] mean score on EF domain (better efficacy is indicated by higher values)

9 (Goldstein 2003, Ishii 2006, Zieglar 2006, Boulton 2001, Rendell 1999, Safarinejad 2004, Stuckey 2003, Hatzichristou 2008, Saenz 2002)

RCTs S5 N* S

1,2 S3 none 1855 1006 Pooled MD 5.82 higher at endpoint (95% CI 4.75 to 6.89).

Very low

Critical

Adverse events (headache)

8 (Boulton 2001, Goldstein 2003, Ishii 2006, Rendell 1999, Saenz 2002, Safarinejad 2004, Stuckey 2003, Ziegler 2006)

RCTs S5 S4 S

2 N none 160/1763

(9.1%) 41/948 (4.3%)

Pooled RR 2.70 (1.16 to 6.28)

Very low

Important

(Discontinuation for AE)

8 (Goldstein 2003, Hatzichristou 2008, Ishii 2006, Rendell 1999, Saenz 2002, Safarinejad 2004, Stuckey 2003, Ziegler 2006)

RCTs S5 N S1,2 N none 42/1753

(2.4%) 14/1037 (1.4%)

Pooled RR 1.59 (0.84 to 3.02)

Very low

Important

Abbreviations: 95%CI, 95% confidence interval; IIEF, International Index of Erectile Function questionnaire; EF, Erectile function domain of IIEF; SEP, Sexual Encounter Profile (diary questions regarding sexual encounter); GEQ, Global Efficacy Question; QoL Quality of Life; RR, risk ratio 1 Downgrade by 1 level: 1 study (Hatzichristou 2008) used low doses (2.5mg and 5mg) of tadalafil, which are licensed for use but are recommended in people who anticipate frequent use of the drug. 10mg is generally recommended (but not for continuous daily use). The other study examining Tadalafil (Saenz 2002) used 10mg and 20mg, therefore these arms combined represent a wide range of different doses. 2 Downgrade by 1 level: 2 studies (Stuckey 2003, Zieglar 2006) were conducted solely in men with type 1 diabetes and the mean age in these studies were generally lower in comparison to the other included studies. One study (Ishii 2006) did not report the proportion of men with type 2 diabetes. 3 Downgrade by 1 level: SDs were not reported in the paper and were calculated using p-values 4 Downgrade by 1 level: pairwise comparisons of the included studies (direct comparisons) showed an I² of 75% (p=0.0002) for headaches, 68% (p=0.009) for upper respiratory tract infection and 58% (p<0.00001) for any adverse event. These values indicate substantial heterogeneity which cannot be fully accounted for 5 Downgrade by 1 level: 2 studies (Saenz 2002, Ishii 2006) do not report allocation concealment to determine if performance bias was present * pairwise comparisons of the included studies (direct comparisons) showed an I² of 46%. Although this may indicate moderate heterogeneity, this inconsistency was not considered to be important as overall the effect estimates and the confidence intervals were favouring the PDE-5 group

PICO

OutcomeOutcomeOutcomeOutcome

Formulate

question

Rate

importa

nce

Critical

Important

Critical

Not important

Create

evidence

profile with

GRADEpro

Summary of findings & estimate of effect for each outcome

Present evidence profile(s) to GDG

RCT start high, obs. data start

low1. Risk of bias2. Inconsisten

cy3. Indirectnes

s4. Imprecision5. Other

consideration

Gra

de d

own

Gra

de u

p 1. Large effect

2. Dose response

3. Confounders

Rate quality

of evidence

for each

outcomeSelect

outcomes

Very low

LowModerate

High

Develop recommendations:•For or against (direction)•Strong or weak (strength)

By considering:Relative value of different

outcomesQuality of evidenceTrade off - benefits/harmsHealth economicsOther considerations

• “Offer xyz…”• “Consider xyz…”• “Do not use xyz…”

Outcomes

across

studies

Evidence to recommendations

• Structured discussion of– Relative value placed on outcomes– Trade off between clinical benefits and harms– Trade off between net health benefits and resource use– Quality of the evidence– Other considerations

• Place within pathway of care• Equalities issues• Practicalities of implementation e.g. need for training

Strength of recommendation

• Stronger: ‘the GDG is confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects’

‘Should do ...’

• Weaker: the GDG concludes that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but is not confident’

‘Should consider ...’

Further information

• http://www.gradeworkinggroup.org/

• Ongoing series of papers in Journal of Clinical Epidemiology addressing all of these issues

• [email protected]

http://www.gradeworkinggroup.org/

grade toni tan, centre for clinical practice. grade the grading of recommendations assessment,...

Documents