developing and maintaining an assessment system

Postgraduate Medical Educationand Training Board

Developing and maintaining an assessment system - a PMETB guide to good practice

Guidance

January 2007

2 Developing and maintaining an assessment system - a PMETB guide to good practice

Foreword

Developing and maintaining an assessment system - a PMETB guide to good practice completes the guidance for medical

Royal Colleges and Faculties who are developing assessment systems based on curricula approved by PMETB.

As the title implies, this is a good practice guide rather than a cook book providing recipes for assessment systems.

This guide covers the assessment Principles 3, 4 and 6. As indicated by PMETB, the Colleges and Faculties developing

assessment systems have until August 2010 to comply with all nine Principles of assessment produced by PMETB (1).

These Principles highlight the issues that need to be addressed for transparency and fairness to trainees and to encourage

curricula designers in not forgetting the duty of care to the trainers and trainees alike.

The work was undertaken by the Assessment Working Group, which consisted of people from different disciplines of

medicine and who are considered experts in designing assessments.

This good practice guide explains some of the challenges faced by anyone who is devising an assessment system. As far as

possible we have developed this guidance in the context of practicality, feasibility in respect of quality management, utility

sources of evidence required for competency progression, standard setting and integrating various assessments bearing

in mind the fine balance between the training and service requirements. We have avoided being prescriptive and have not

produced a toolbox of PMETB approved assessment tools. Instead, we recommend that the Colleges and Faculties should

consult the guidance produced by the Academy of Medical Royal Colleges (AoMRC) as well as Modernising Medical

Careers (MMC) to choose assessment tools which comply with PMETB’s Principles for an assessment system for postgraduate

medical training (1).

PMETB has been fortunate in securing the services of highly skilled and enthusiastic experts who worked in their own

time on the Assessment Working Group to produce this document. I would like to extend my grateful thanks to all these

dedicated people on behalf of PMETB. I would principally like to mention the dedication and effort of the work stream

leaders, Dr Gareth Holsgrove, Dr Helena Davies and Professor David Rowley, who have worked through all hours in

developing this guide.

Dr Has Joshi FRCGP

Chair of the Assessment Committee and the Assessment Working Group

PMETB

January 2007

�Developing and maintaining an assessment system - a PMETB guide to good practice

The Assessment Working Group

Dr Has Joshi FRCGP: Chair of the Assessment Committee and the Assessment Working Group

Jonathan D Beard: Consultant Vascular Surgeon and Education Tutor, Royal College of Surgeons of England

Dr Nav Chana: General Practitioner, RCGP Assessor

Helena Davies: Senior Lecturer in Late Effects/Medical Education, University of Sheffield

John C Ferguson: Consultant Surgeon, Southern General Hospital, Glasgow

Professor Tony Freemont: Chair of the Examiners in Histopathology, Royal College of Pathologists

Miss L A Hawksworth: Director of Certification, PMETB

Dr Gareth Holsgrove: Medical Education Adviser, Royal College of Psychiatrists

Dr Namita Kumar: Consultant Rheumatologist and Physician

Dr Tom Lissauer: FRCPCH: Officer for Exams, Royal College of Paediatrics and Child Health, and Consultant Neonatologist,

St Mary’s Hospital, London

Dr Andrew Long: Consultant Paediatrician and Director of Medical Education, Princess Royal University Hospital

Dr Amit Malik: Specialist Registrar in Psychiatry, Nottinghamshire Healthcare NHS Trust

Dr Keith Myerson: Member of Council, Royal College of Anaesthetists

Mr Chris Oliver: Co-Convener Examinations, Royal College of Surgeons of Edinburgh and Consultant Trauma Orthopaedic

Surgeon, Edinburgh Orthopaedic Trauma Unit

Jan Quirke: Board Secretary, PMETB

Professor David Rowley: Director of Education, Royal College of Surgeons Edinburgh

Dr David Sales: RCGP Assessment Fellow

Professor Dame Lesley Southgate: PMETB Board member to November 2006 and Chair Assessment Committee to January

2006

Dr Allister Vale: Medical Director, MRCP (UK) Examination

Winnie Wade: Director of Education, Royal College of Physicians

Val Wass: Professor of Community Based Medical Education, Division of Primary Care, University of Manchester

Laurence Wood: RCOG representative to the Academy of Royal Colleges, Associate Postgraduate Dean West Midlands

EditorsPMETB would like to thank the following individuals for their editorial assistance in the production of this guide to good

practice:

Helena Davies - Senior Lecturer in Late Effects/Medical Education, University of Sheffield

Dr Has Joshi, FRCGP - Chair of the Assessment Committee and the Assessment Working Group

Dr Gareth Hoslgrove - Medical Education Adviser, Royal College of Psychiatrists

Professor David Rowley - Director of Education, Royal College of Surgeons Edinburgh


Introduction ......................................................................................................................................6

Chapter 1: An assessment system based on principles .............................................................................7

Introduction .......................................................................................................................................................... 7

Utility .................................................................................................................................................................. 7

What is meant by utility? ........................................................................................................................... 7

Why is the utility index important? ............................................................................................................ 8

Defining the components ..................................................................................................................................... 8

Reliability .................................................................................................................................................. 8

Validity ...................................................................................................................................................... 9

Educational impact .................................................................................................................................. 10

Cost, acceptability and feasibility ........................................................................................................... 12

Chapter 2: Transparent standard setting in professional assessments ....................................................... 13

Introduction ........................................................................................................................................................ 13

Types of standard ............................................................................................................................................... 13

Prelude to standard setting ................................................................................................................................ 14

Standard setting methods ........................................................................................................................ 15

Test based methods ................................................................................................................................. 15

Trainee or performance based methods ................................................................................................. 16

Combined and hybrid methods - a compromise ..................................................................................... 16

Hybrid standard setting in performance assessment ......................................................................................... 16

Standard setting for skills assessments .............................................................................................................. 17

A proposed method for standard setting in skills assessments using a hybrid method ...................................... 17

Standard setting for workplace based assessment ............................................................................................. 19

Anchored rating scales ....................................................................................................................................... 19

Decisions about borderline trainees .................................................................................................................. 19

Making decisions about borderline trainees ........................................................................................... 19

Summary ............................................................................................................................................................ 20

Conclusion ......................................................................................................................................................... 20

Chapter 3: Meeting PMETB approved Principles of assessment ............................................................... 21

Introduction ........................................................................................................................................................ 21

Quality assurance and workplace based assessment ......................................................................................... 21

The learning agreement ..................................................................................................................................... 22

Appraisal............................................................................................................................................................ 22

The annual assessment ....................................................................................................................................... 23

Additional information ....................................................................................................................................... 23

Quality assuring summative exams ......................................................................................................... 23

Summary ............................................................................................................................................................ 25

Chapter 4: Selection, training and evaluation of assessors ...................................................................... 26

Introduction ........................................................................................................................................................ 26

Selection of assessors ......................................................................................................................................... 26

Assessor training ................................................................................................................................................ 26

Feedback for assessors ...................................................................................................................................... 27

Chapter 5: Integrating assessment into the curriculum - a practical guide ................................................ 28

Introduction ........................................................................................................................................................ 28

Blueprinting ....................................................................................................................................................... 28

Sampling ............................................................................................................................................................ 28

The assessment burden (feasibility) ................................................................................................................... 29

Feedback ........................................................................................................................................................... 29

Table of Contents

5Developing and maintaining an assessment system - a PMETB guide to good practice

Chapter 6: Constructing the assessment system .................................................................................... 30 Introduction ........................................................................................................................................................ 30

Purposes ............................................................................................................................................................ 30

One classification of assessments ....................................................................................................................... 30

References .................................................................................................................................... 32 Further reading .................................................................................................................................................. 34

Appendices .................................................................................................................................... 36 Appendix 1: Reliability and measurement error ................................................................................................. 36

Reliability ................................................................................................................................................ 36

Measurement error .................................................................................................................................. 37

Appendix 2: Procedures for using some common methods of standard setting .................................................. 39

Test based methods ................................................................................................................................. 39

Trainee based methods ........................................................................................................................... 40

Combined and compromise methods ..................................................................................................... 41

Appendix 3: AoMRC, PMETB and MMC categorisation of assessments............................................................... 42

Purpose ................................................................................................................................................... 42

The categories......................................................................................................................................... 42

Appendix 4: Assessment good practice plotted against GMP ............................................................................. 44

Glossary of terms ............................................................................................................................................... 46


Introduction

This guide explains some of the challenges which face anyone devising an assessment system in response to the Principles

for an assessment system for postgraduate medical training laid out by PMETB (1).

The original principles have not needed any fundamental changes since they were written and serve to highlight the

issues which should be addressed to ensure transparency and fairness for trainees, and encourage curricular designers to

ensure a proper duty of care to trainers and trainees alike. Most importantly, the principles assure the general public that

trainees who undergo professional accreditation will be assessed properly and only those who have achieved the required

level of competence are allowed to progress. In producing this document we wish to acknowledge that most colleges

and many committed individuals, usually as volunteers, have considerable expertise in the area of assessment. Where

possible we have drawn on that expertise and we hope this is reflected in the text. What we wish to do is to provide a long

term framework for continuing to improve assessments for all parties, from trainees to patients. Educational trends change

but principles do not and by providing this document PMETB wishes to set a benchmark against which a programme of

continuing quality improvement can progress.

A full glossary of terms related to assessment in the context of medical education can be found on page 46. In particular,

to ensure consistency with other PMETB guidance, we largely refer to ‘assessment systems’ rather than ‘assessment

programmes’ and ‘quality management’ rather than ‘quality control’.

The term ‘assessor’ should be assumed to encompass examiners for formal exams as well as those undertaking assessments

in other contexts, including the workplace. ‘Assessment instrument’ is used throughout to refer to individual assessment

methods.

The guide is a reference document rather than a narrative and it is anticipated that it will help organisations collate all the

information likely to be asked of them by PMETB in any quality assurance activity.


Chapter 1: An assessment system based on principles

Introduction

This chapter aims to:

• define what is meant by utility in relation to assessment;

• explain the importance of utility and its evolution from the original concept;

• provide clear workable definitions of the components of utility that enable the reader to understand their

relevance and importance to their own assessment context;

• summarise the existing evidence base in relation to utility both to provide a guide for the interested reader

and to reassure those responsible for assessment that there is a body of research that can be referred to;

• highlight gaps where further research is needed.

Utility

The ‘utility index’ described by Cees van der Vleuten (2) in 1996 serves as an excellent framework for assessment design

and evaluation (3).

What is meant by utility?

Figure 1: Utility

The original utility index described by Cees van der Vleuten consisted of five components:

• Reliability

• Validity

• Educational impact

• Cost efficiency

• Acceptability

Given the massive change in postgraduate training in the UK and the significantly increased assessment burden which has

occurred, it is important that feasibility is explicitly acknowledged as an additional sixth component (although it is implicit

in cost effectiveness and acceptability).

It acknowledges that optimising any assessment tool or programme is about balancing the six components of the utility

index. Choice of assessment instruments and aspirations for high validity and reliability are limited by the constraints of

feasibility, e.g. resources to deliver the tests and acceptability to the trainees. The relative importance of the components of

the utility index for a given assessment will depend on both the purpose and nature of the assessment system.

For example, a high stakes examination on which progression into higher specialist training is dependent will need high

reliability and validity, and may focus on this at the expense of educational impact. In contrast, an assessment which

focuses largely on providing a trainee with feedback to inform their own personal development planning would focus

on educational impact, with less of an emphasis on reliability (Figure 2). Figure 2 illustrates the relative importance of

reliability vs educational impact, depending on the purpose of the assessment.

Educational x validity x reliability x cost x acceptability x feasibility*

*Not in van der Vleuten’s original utility index but explicitly here because of its importance


Why is the utility index important?

There is an increasing recognition that

no single assessment instrument can

adequately assess clinical performance and

that assessment planning should focus on

assessment systems with triangulation of data

in order to build up a complete picture of a

doctor’s performance (4, 5). The purpose of

both the overall assessment system and its

individual assessment instruments must be

clearly defined for PMETB. The rationale for

developing/implementing the individual

components within the assessment system

must be transparent, justifiable and based on

supportive evidence from the literature, where

possible, whilst at the same time recognising

the constraints of the ‘real world’. The utility index is an important component of the framework presented to PMETB and

allows a series of questions that should be asked of each assessment instrument - and of the assessment system as a whole.

An understanding of these individual components of the utility index will help when planning or reviewing assessment

programmes in order to ensure that, where possible, all of its components have been addressed adequately.

Defining the components

Reliability

What is the quality of the results? Are they consistent and reproducible? Reliability is of central importance in assessment

because trainees, assessors, regulatory bodies and the public alike want to be reassured that assessments used to ensure

that doctors are competent would reach the same conclusions if it were possible to administer the same test again on the

same doctor in the same circumstances.

Reliability is a quantifiable measure which can be expressed as a coefficient and is most commonly approached using

classical test theory or generalisability analysis (6-8,3, 9). A perfectly reproducible test would have a coefficient of 1.0; that

is 100% of the trainees would achieve the same rank order on retesting. In reality, tests are affected by many sources of

potential error such as examiner judgments, cases used, trainee nervousness and test conditions.

Traditionally, a reliability coefficient of greater than 0.8 has been considered as an appropriate cut off for high stakes

assessments. It is recognised, however, that reliability coefficients at this level will not be achievable for some assessment

tools but they may nevertheless be a valuable part of an assessment programme, both to provide additional evidence for

triangulation and/or because of their effect on learning. Estimation of reliability as part of the overall quality management

(QM) of an assessment programme will require specialist expertise, over and above that to be found in those simply

involved as assessors. This means evidence of the application of appropriate psychometric and statistical support for the

evaluation of the programme should be provided.

Exploration of sources of bias is essential as part of the overall evaluation of the programme and collection of, for example,

demographic data to allow exploration of effects, such as age, gender, race, etc, must be planned at the outset.

Intrinsic to the validity of any assessment is analysis of the scores to quantify their reproducibility. An assessment cannot

be viewed as valid unless it is reliable. PMETB will require evidence of the reliability of each component appropriate to the

weight given to that component in the utility equation.

Figure 2: Utility function

100%

0%

In-trainingformativeassessment

U = R x V x E

100%

0%

High stakesassessment

U = R x V x EU = UtilityR = RealityV = ValidityE = Educational ImpactC van der Vleuten

100%

0%


U = R x V x E

100%

0%



100%

0%


U = R x V x E

100%

0%




Remember that sufficient testing time is essential in order to achieve adequate reliability. It is becoming increasingly clear

that whatever the format, total testing time, ensuring breadth of content sampling and sufficient individual assessments by

individual assessors, is critical to the reliability of any clinical competence test (10) (Box 1).

Box 1: Reliability as a function of testing time

Testing time in hours

MCQ1

Case based short

essay2

PMP1 Oral exam3

Long case4 OSCE5 mini

- CEX6

Practice video

assess-ment7

Incognito SPs8

1 0.62 0.68 0.36 0.50 0.60 0.47 0.73 0.62 0.61

2 0.76 0.73 0.53 0.69 0.75 0.64 0.84 0.76 0.76

4 0.93 0.84 0.69 0.82 0.86 0.78 0.92 0.93 0.92

8 0.93 0.82 0.82 0.90 0.90 0.88 0.96 0.93 0.93

A significant current challenge is to introduce sample frameworks into workplace based assessments of performance

which sample sufficiently to address issues of content specificity. Because content specificity (differences in performance

across different clinical problem areas) and assessor variability consistently represent the two greatest threats to reliability,

sampling of both clinical content and assessors is essential and this should be reflected in assessment system planning.

PMETB will require an explanation of the weight placed on each assessment tool within the modified utility index. For

example, a workplace based assessment aimed at testing performance has a higher weighting for validity (at the apex of

Miller’s Pyramid - Figure 3) but may not achieve a high stakes reliability coefficient of >0.8 as it is difficult to standardise

content. On the other hand the inclusion of a defensible clinical competency assessment in an artificial summative

examination environment to decide on progression may be justified on the grounds of high reliability, but only achieved at

the expense of high face validity.

Validity

Reliability is a measure of how reproducible is a test. If you administered the same assessment again on the same person

would you get the same outcome? Validity is a measure of how completely an assessment tests what it is designed to test.

There is usually a trade off between validity and reliability as the assessment with perfect validity and perfect reliability

does not exist. However, it is important to recognise that if a test is not reliable it cannot be valid.

Validity is a conceptual term which should be approached as a hypothesis and cannot be expressed as a simple coefficient

(11). It is evaluated against the various facets of clinical competency. Traditionally, a number of facets of validity have been

defined (12) (Box 2), separately acknowledging that evaluating the validity of an assessment requires multiple sources

of evidence. An alternative approach arguing that validity is a unitary concept which requires these multiple sources of

evidence to evaluate and interpret the outcomes of an assessment has also been proposed (11). Box 2 summarises the

traditional facets of validity and can provide a useful framework for evaluating validity. Predictive and consequential

validity are important but poorly explored aspects of assessment, particularly workplace based assessment. Consequential

validity is integral to evaluation of educational impact. Predictive validity may not be able to be evaluated for many years

but plans to determine predictive validity should be described and this will be facilitated by high quality centralised data

management.

1Norcini et al., 19852Stalenhoef-Halling et al., 19903Swanson, 1987

4Wass et al., 20015Petrusa, 20026Norcini et al., 1999

7Ram et al., 19998Gorter, 2002


Box 2: Traditional facets of validity

Type of validity Test facet being measured Questions being asked

Face validityCompatibility with the curriculum’s

educational philosophy

What is the test’s face value? Does it match up with the educational

intentions?

Content validity (May also be referred to as direct validity)

The content of the curriculumDoes the test include a representative

sample of the subject matter?

Construct validity (May also be referred to as indirect validity)

Does the evidence in relation to assessment support a sensible underpinning construct

or constructs?

What is the construct, does the evidence support it?

E.g. differentiation between novice and expert on a test of overall clinical

assessments - do two assessments designed to test different things have a low

correlation?

Predictive validityThe ability to predict an outcome in the

future, e.g. professional success after graduation

Does the test predict future performance and level of competency?

Consequential validityThe educational consequence or impact of

the testDoes the test produce the desired

educational outcome?

Educational impact

Assessment must have clarity of educational purpose and be designed to maximise learning in areas relevant to the

curriculum. Based on the assumption that assessment drives learning that underpins training, it should be used strategically

to promote desirable learning strategies in contrast to some of the learning behaviours that have been promoted by

traditional approaches to assessment within medicine. Careful planning is essential. Agreement on how to maximise

educational impact must be an integral part of planning assessment and the rationale and thinking underpinning this must

be evident to those reviewing the assessment programme. Several different factors contribute to overall educational impact

and a number of questions will therefore need to be considered.

a) What is the educational intent or purpose of the assessment?

In the past a clear distinction between summative and formative assessment has been made. However, in line with modern

assessment theory, the PMETB principles emphasise the importance of giving students feedback on all assessments,

encouraging reflection and deeper learning (PMETB Principle 5(1)). The purpose of assessment should be clearly

described. For example, is it for final certification, is it to determine progress from one stage to another, is it to determine

whether to exclude an individual from their training programme, etc?

Feedback should be provided that is relevant to the purpose as well as the content of the assessment in order that personal

development planning in relation to the relevant curriculum can take place effectively. If assessment focuses only on

certification and exclusion, the all important potential for a beneficial influence on the learning process will be lost. All

those designing and delivering assessment should explore ways of enabling feedback to be provided at all stages and

make their intentions transparent to trainees.

A quality enhanced assessment system cannot be effective without high quality feedback. In order to plan appropriate

feedback it is essential for there to be clarity of purpose for the assessment system. For example, what aptitudes are you

aiming to assess, at what level of expertise and how was the content of the assessment defined relative to the curriculum - a

process known as blueprinting (13).

b) What level of competence are you trying to assess? Is it knowledge, competence or performance?

A helpful and widely utilised framework for describing levels of competence is provided by Miller’s Pyramid (14)

(Figure 3).

The base represents the knowledge components of competence: ‘Knows’ (basic facts) followed by ‘Knows How’ (applied

knowledge). The progression to ‘Knows How’ highlights that there is more to clinical competency than knowledge alone.


‘Shows How’ represents a behavioural rather than a cognitive

function, i.e. it is ‘hands on’ and not ‘in the head’. Assessment

at this level requires an ability to demonstrate a clinical

competency in a controlled environment, e.g. an objective

structured clinical examination (OSCE). The ultimate goal for a

highly valid assessment of clinical ability is to test performance

- the ‘Does’ of Miller’s Pyramid, i.e. what the doctor actually

does in the workplace. It must be clear at which level you are

aiming to test.

c) At what level of expertise is the assessment set?

Any assessment design must accommodate the progression

from novice through competency to expertise. It must be

clear against what level the trainee is being assessed. A number of developmental progressions have been described for

knowledge, including those in Bloom’s taxonomy (15) (Figure 4).

Frameworks are also being developed for the clinical competency model

(16). Most of the Royal Colleges are working on assessment frameworks

that describe a progression in terms of level of expertise as trainees

move through specialty training. When designing an assessment system

it is important to identify the level of expertise anticipated at that point in

training. The question ‘Is the assessment and standard appropriate for the

particular level of training under scrutiny?’ must always be asked. It is not

uncommon to find questions in postgraduate examinations assessing basic

factual knowledge at undergraduate level rather than applied knowledge

reflective of the trainees’ postgraduate experience.

d) Is the clinical content clearly defined?

Once the purpose of the assessment is agreed, test content must be

carefully planned against the curriculum and intended learning outcomes,

a process known as blueprinting (12, 17, 18) (see also page 28).

The aim of an assessment blueprint is to ensure that sampling within the assessment system ensures adequate coverage of:

i) A conceptual framework - a framework against which to map assessment is essential. PMETB recommends Good Medical

Practice (GMP) (19) as the broad framework for all UK postgraduate assessments.

ii) Content specificity - blueprinting must also ensure that the contextual content of the curriculum is covered. Content

needs careful planning to ensure trainees are comprehensively and fairly assessed across their entire training period.

Professionals do not perform consistently from task to task or across the range of clinical content (20). Wide sampling

of content is essential (13). Schuwirth (3) and van der Vleuten highlight the importance of consideration of the context

as well as content of assessment. Context-rich methods test application of knowledge, whereas context-free questions

test only the underpinning knowledge base. Sampling broadly to cover the full range of the curriculum is of paramount

importance if fair and reliable assessments are to be guaranteed. Blueprinting is essential to the appropriate selection of

assessment methods; it is not until the purpose and the content of the assessments has been decided that the assessment

methods should be chosen.

iii) Selection of assessment methods once blueprinting has been undertaken should take account of the likely educational

impact; the nature of assessment methods will influence approaches to learning as well as the stated content coverage.

e) Triangulation - how do the different components relate to each other to ensure educational impact is achieved?

It is important to develop an assessment system which builds up evidence of performance in the workplace and avoids

reliance on examinations alone. Triangulation of observed contextualised performance tasks of ‘Does’ can be assessed

alongside high stakes competency based tests of ‘Shows How’ and knowledge tests where appropriate, (Figure 5).

Individual assessment instruments should be chosen in the light of the content and purpose of that component of the

Figure 4: Bloom’s taxonomy

Expertise

Evaluateappraise, discriminate

Synthesisintegrate, design

Applicationdemonstrate

Analyseorder

Comprehensioninterpret, discuss

Knowledgedefine, describe

Figure 3: Miller’s Pyramid

Does

Shows How

Knows How

Knows

Work basedassessment

Performance/action

OSCE

CCS

MCQ

Competence


assessment system. PMETB will require evidence of

how the different methods used relate to each other to

ensure an appropriate educational balance has been

achieved.

Cost, acceptability and feasibility

PMETB recognises that assessment takes place in

a real world where pragmatism must be balanced

against assessment idealism. It is essential that

consideration is given to issues of feasibility, cost and

acceptability, recognising that assessments will need

to be undertaken in a variety of contexts with a wide

range of assessors and trainees. Explicit consideration

of feasibility is an essential part of evaluation of any

assessment programme.

a) Feasibility and cost

Assessment is inevitably constrained by feasibility

and cost. Trainee numbers, venues for structured

examinations, the use of real patients, timing of exit assessments and the availability of assessors in the workplace all place

constraints on assessment design. These factors should be part of your explanation to justify the design of your assessment

package. Management of the overall assessment system including infrastructure to support it is an important contributor to

feasibility. In general, centralisation is likely to increase cost effectiveness. All assessments incur costs and these must be

acknowledged and quantified.

b) Acceptability

Both the trainees’ and assessors’ perspective must be taken into account. At all levels of education, trainees naturally

tend to feel overloaded by work and prioritise those aspects of the curriculum which are assessed. To overcome this, the

assessment package must be designed to mirror and drive the educational intent. It must be acceptable to the learner.

The balance is a fine one. Creating too many burdensome, time consuming assessment ‘hurdles’ can detract from the

educational opportunities of the curriculum itself (21, 22).

Consideration of the acceptability of assessment programmes to assessors is also important. All assessment programmes

are dependent on the goodwill of assessors who are usually balancing participation in assessment against many other

conflicting commitments. Formal evaluation of acceptability is an important component of QM and approaches to this

should be documented.

The high stakes of professional assessments need to be acknowledged not only for potential colleagues of those being

assessed and trainees, but also for the general public on whom professionals practice. It is therefore essential that any

assessment system is transparent, understandable and demonstratively comprehensible to the general public as well as

other stakeholders.

Figure 5: Triangulation

Exams

Work based assessm

ent

Triangulate evidence


Chapter 2: Transparent standard setting in professional assessments

Introduction

Standard setting is the process used to establish the level of performance required by an examining body for an individual

trainee to be judged as competent. This might be simply in the recall or (preferably) the application of factual knowledge;

competence in specific skills or technical procedures; performance, day in day out, in the workplace, or a combination of

some or all of these. Whatever the aspect and level of performance, the standard is the answer to the question, ‘How much is

enough?’ (23), and is the point that separates those trainees who pass the assessment from those who do not. In other words,

it is the pass mark or, in North America, the cut or cutting score. It should be noted, too, that there will almost inevitably be

a group of trainees with marks close to the pass mark that the assessment cannot reliably place on one side or the other.

Having considered some methods for standard setting, this section will discuss reliability and measurement error, and how

to identify these borderline trainees.

However, even before describing the processes and outcomes of standard setting, it must be recognised that although

the concept of standard setting might seem straightforward, its methods and the debate surrounding them are not. In fact,

there is a cohort of educational academics (such as Gene Glass, 1978 (24)) who are highly critical of the whole concept

of standard setting. Certainly, they are not without a case. It is widely known that the standard set for a given test can vary

according to the methods used. Experience also shows that different assessors set different standards for the same test

using the same method. The aim of this guide, however, is to be practical rather than philosophical and, therefore, PMETB

agrees with Cizek’s conclusion that ‘the particular approach to standard setting selected may not be as critical to the

success of the endeavour as the fidelity and care with which it is conducted’ (25). Since PMETB’s priority is to ensure that

passing standards are set with due diligence and at sufficiently robust levels as to ensure patient safety, assessors should

choose methods that they are happy with. It is essential that the people using those methods are appropriately trained and

approach the task in a fair and professional manner.

The literature describes a wide variation of methods (26) and the procedures for many are set out very clearly (27).

Nevertheless, it quickly becomes plain that there is no single best standard setting method for all tests, although there is

often a particularly appropriate method for each assessment. However, there are three main requirements in the choice of

method. It must be:

• defensible, to the extent that it can assure the stakeholders about its validity;

• explicable, through the rationale behind the decisions made;

• stable, as it is not defensible if the standards vary over time (28).

Simply selecting the most appropriate method of standard setting for each element in an examination is not enough. As

mentioned above, the selection and training of the judges or subject experts who set the standard for passing assessments

is as important as the chosen methodology (29).

Types of standard

There are two different kinds of standard - relative and absolute. Relative standards are based on a comparison between

trainees and they pass or fail according to how well they perform in relation to the other trainees. An exam in which there

is a fixed pass rate (for example, the top 80% of the top 200 trainees pass) uses a relative standard. By contrast, when an

absolute standard is applied trainees pass or fail according to their own performance, irrespective of how any of the other

trainees perform. It is generally accepted that unless there is a particularly good reason to pass or fail a predetermined

number of trainees, an absolute standard (based on individual trainee performance) should be used. The methods

described in this chapter are for setting absolute standards. This is because absolute or criterion-referenced standards are

preferred for any assessment used to inform licensing decisions.

The use of relative standards might result in passing trainees with little regard to their ability. For example, if all the trainees

in a cohort were exceptionally skilled, the use of norm-referenced standards (passing the top n% of the trainees) would

result in failing (i.e. misclassifying) a certain proportion who, in fact, possess adequate ability. This is certainly unfair and

at variance with the purpose of a test of competence. Moreover, since relative standards will vary over time with the ability

of the trainees being assessed, the reliability of any competence based classifications could be questionable. Therefore,

if valid measures of competence are desired, it is essential to set standards with reference to some absolute and defined

performance criterion (30). In other words, standards should be set using absolute methods.


Absolute standard setting methods can be broadly classified as either assessment centred or individual trainee

(performance) centred. In test centred methods, theoretical decisions based on test content are used to derive a standard,

whereas in trainee centred methods, judgments regarding actual trainee performance are used to determine the passing

score. It is implicit that each of these methods will ensure that all performance test material will be subjected to standard

setting as an integral part of the test development.

In order to set the standard, three things need to be established:

• the purpose of the assessment;

• the domains to be assessed;

• the level of the trainee at the time of that particular assessment.

Based on these characteristics, there are various established methods of standard setting that can be used. However, before

doing so it is necessary to point out that there has been something of a tradition in UK medical education for standards to be

set without giving proper consideration to these points. For example, there are still examinations in which the pass mark has

been set quite arbitrarily, often long before the exams themselves have even been written, and subsequently be enshrined

in the regulations, making change particularly difficult. There are ways around this, but it would still be far better if pass

marks were not predetermined in this way.

Moreover, the examination methods are often also stipulated, but not the content of the exam. Correctly, of course, what is

to be assessed should be established before selecting the methods. Clearly, such procedures are quite unacceptable in

contemporary postgraduate medical education, particularly when the consequences of passing or failing assessments can

be so important. Indeed, despite all the talk about maintaining (or, contemporarily, ‘driving up’) standards, the process of

determining what the standards are has been an extraordinarily lax affair (31).

PMETB requirements are proving to be a powerful incentive in bringing about long overdue improvements, and examining

bodies are increasingly striving to ensure that all aspects of their assessments are conducted properly, defensibly and

transparently. Standard setting is an important element in this.

Prelude to standard setting

Before the standard can be set there must be agreement about the purpose of the assessment, what will be assessed and

how, and the level of expertise that trainees might be expected to demonstrate. For example, the assessment might be made

to check that progress through the curriculum is satisfactory, or to identify problems and difficulties at an early stage when

they will probably be easier to resolve. On the other hand, it might be to confirm completion of a major stage of professional

development, such as graduation or Royal College membership. This will also enable the standard setters to agree on

the trainees’ expected level of expertise. Experience has shown that the level of expertise is very often overestimated,

particularly by content experts, and is one of several good reasons why assessors should undertake the assessment

themselves before they set the standards.

This overestimate of trainees’ ability can lead to the pass mark being set unrealistically high, or to the assessment

containing an excessive proportion of difficult items. Contrary to the belief held by many assessors that difficult exams

sort out the best trainees, the most effective assessment items are generally found to be those that are moderately difficult

and a good discriminator through covering a wide sample of the prescribed curriculum. Even trainee centred methods of

standard setting, such as the borderline group method and the contrasting groups method (described below and expanded

upon in Appendix 2), depend on the discriminatory power of the items - individually in the borderline group method and

across the exam as a whole in contrasting groups.

The content of the assessment will be determined by the curriculum, in accordance with Principle 2 (1). In the case of

workplace based assessment, these might be described in terms of competencies and other observable behaviours and

rated against descriptions of levels of performance. In formal assessments as part of set piece examinations, the content

should reflect the relative importance of aspects of the curriculum, so that essential and important elements predominate.

The standard in workplace based assessment is therefore usually determined by specific levels of performance for items

often prescribed on a checklist and this is discussed below. In formal examination style assessments the standard will take

account of the importance and difficulty of the individual items of assessment and a method is described which includes

this consideration. The methods described below are principally used as assessment as part of formal examinations, but

this guide has also discussed some issues of standard setting in workplace based assessment.


Standard setting methods

There are several methods for standard setting. They can be seen as falling into four categories:

• relative methods;

• absolute methods based on judgments about the trainees;

• absolute methods based on judgments about the test items;

• combined and compromise methods.

As indicated above, this guide will not consider relative methods in any more detail.

This leaves methods based on judgment about the trainees, individual assessment items and combined and compromise

methods. In test centred methods, theoretical decisions based on test content are used to derive a standard, whereas in trainee

centred methods judgments regarding actual trainee performance are used to determine the appropriate passing score.

Test based and compromise methods consist of the three main methods of standard setting in formal knowledge based

exams, though there are several variations of each method. Trainee based methods are currently gaining in popularity.

The simplest test based method is Angoff’s (32). Ebel’s (33) method is slightly more complicated, yet probably leads to a

better examination design. Both are based on judgments about the assessment items. This guide also describes two trainee

based methods and the Hofstee method, a combined/compromise method which is more complex and best used with large

cohorts of trainees.

Test based methods

In general, test based methods require assessors to act as subject experts to make judgments regarding the anticipated

performance of ‘just passing’ trainees (i.e. a ‘borderline pass’) on defined content or skills. The Angoff procedure is

probably the best known example of assessment centred method and has subsequently undergone various modifications.

1) Angoff’s method

Originally developed for standard setting in multiple choice examinations, this method has also been used to set standards

on the history taking and physical examination checklist items that are often used for scoring cases in skills assessments.

Here, the assessors are required to make judgments as subject experts as to the probability of a ‘just passing’ trainee

answering the particular question or performing (correctly) the indicated task. The assessors’ mean scores are used to

calculate a standard for the case. However, this method is better suited to standard setting in knowledge tests as it has some

significant disadvantages for performance testing.

For example, it is very time consuming and labour intensive, especially when there are multiple checklists. It may also

yield too stringent standards. Thirdly, and more importantly, since the resulting standard is a mean assessment across items

and/or tasks, the use of this method makes the implicit assumption that ratings on tasks are independent. However, this

assumption is often untenable with performance assessments because of the phenomenon of case specificity. This means

that essentially, individual checklist items are often interrelated within a task. As a result, the assessors’ judgments are not

totally independent, potentially invalidating the use of this method for setting standards (34).

An alternate method is to instruct the subject experts to make assessments regarding the number of checklist items

that a ‘just passing’ trainee would be expected to obtain credit for. While this may reduce the problem of checklist item

dependencies and substantially shorten the time taken to set standards, the task of deciding how many items constitute a

borderline pass remains challenging with regard to rules of combination and compensation. As a result, the precision of the

standard derived may be compromised. See Appendix 2 for further details.

2) Ebel’s method

Only slightly more complicated than Angoff’s method, Ebel’s method can be considerably more useful in practice,

especially when building and managing question banks. Holsgrove and Kauser Ali’s (35) modification of Ebel’s method

was developed for a large group of postgraduate medical exams, namely the Membership and Fellowship exams of the

College of Physicians and Surgeons, Pakistan. This modification has the advantage of not only helping examiners to set a

passing standard, but also to produce an examination with appropriate coverage of essential, important and supplementary

material, with a balance of difficult, moderate and easy items. These three factors are important in improving the stability of

examinations when repeated many times.

In its original form, Ebel’s method was suitable for simple right/wrong items, such as multiple choice questions (MCQs) of the

‘one best answer’ type. The Holsgrove and Kauser Ali modification allows it to be used for more complex items such as OSCE

stations and work is underway to explore its potential for workplace based assessment. See Appendix 2 for more details.


Trainee or performance based methods

Trainee or performance based methods have been used increasingly as the standard setting method of choice in clinical

skills and performance assessments. These methods are more intuitively appealing to assessors as they afford greater ease

when making judgments about specific performances. Additionally, assessors find the process and results more credible

because the standard is derived from judgments based on the actual test performances (30).

Instead of providing judgments based on test materials, the panel of subject experts is invited to review a series of trainee

performances and make judgments about the demonstrated level of proficiency.

1) Borderline group method

Described by Livingston and Zieky in 1982 (23), this method requires expert judges to observe multiple trainees on a single

station or case (rather than following a single trainee around the circuit) and give a global rating for each on a three point

scale:

• pass;

• borderline;

• fail.

The performance is also scored, either by the same assessor or another, on a checklist. Trained simulated patients might

be considered sufficiently expert to serve as assessors, especially in communication and interpersonal skills. The global

ratings using the three point scale are used to establish the checklist ‘score’ that will be used for the passing standard.

A variety of modifications of this method exist (36), but it is important in all of them that the examiners must be able to determine

a borderline performance level of skills in the domain sampled. See Appendix 2 for more detail.

2) Contrasting groups method

Procedures, such as the contrasting groups method (37) and associated modifications, have focused on the actual

performance of contrasting groups of trainees identified by a variety of methods. This method requires that the trainees are

divided into two groups which can be variously labelled as pass/fail; satisfactory/unsatisfactory; competent/not competent,

etc. There are various ways of doing this; for example as external criteria or specific competencies set out in the curriculum

and specified on the multiple item score sheet. However, the group into which they are placed depends on their global

rating across the performance criteria.

Assessors rate each trainee’s performance at each station or case, using a specific score sheet for each. After the

assessment, scores from each of the two contrasting groups are expressed graphically and the passing standard

is provisionally set where the two groups intersect. In practice this almost always produces an overlap in the score

distributions of the contrasting groups.

However, this method allows for further scrutiny and adjustment so that if the point of intersection is found to allow trainees

to pass who should have rightly failed (or vice versa) the pass mark can be adjusted appropriately. See Appendix 2 for

more detail.

Combined and hybrid methods - a compromise

There are various approaches to standard setting that combine aspects of other methods or, in the case of the Hofstee

method described below, both relative and absolute methods. A proposed hybrid method is described later in this section.

Hofstee’s method

This is probably the best known of the compromise methods, which combines aspects of both relative and absolute

standard setting. It takes account of both the difficulty of the individual assessment items and of the maximum and minimum

acceptable failure rate for the exam and was designed for use in professional assessments with a large number of trainees.

See Appendix 2 for more detail.

Hybrid standard setting in performance assessment

Unlike knowledge assessments, which have been extensively researched to guide their standard setting methods and

whose standards can be determined by a defined group of modest size, performance assessments have not been so well

informed by a standard setting evidence base.

Neither test nor trainee based approaches can readily be applied at the case level to a high fidelity clinical assessment in

which standards are implicit in the grading of the individual cases.


A hybrid approach has been proposed (38) and is described below. Assessors are required to identify a passing standard of

performance on each individual assessment that they observe, using generic and case specific guidance. Each case would

thus be passed or failed. The standard setting issue then relates to how many individual assessments need to be passed

(and/or not failed) in order to pass the assessment as a whole.

In order to address this, the panel of assessors as subject experts would collectively agree on the standard to pass,

converting the individual assessment grades to pass/fail overall, and it will do this by agreeing a decision algorithm. The

assessors could carefully review the pass/fail algorithm and collectively support it. Methods such as the Delphi technique,

which helps avoid over influencing of decisions by powerful or dominant characters, could facilitate this process.

The standard may then be verified and refined by the application of either performance based approach to examples of

individual assessments and/or the trainee’s overall performance in the full battery of assessments.

Standard setting for skills assessments

Numerous standard setting methods have been proposed for knowledge tests, such as multiple choice examinations (32).

However, whilst the underlying principles, such as identification of the borderline or ‘just passing’ trainee, are common,

they are not necessarily appropriate for standard setting in performance assessments. Although a range of standard setting

methods have been used in skills assessments, this has been shown to yield different (30) and inconsistent results (39),

particularly if different sets of assessors are also used (30). This is the point at which idealism meets reality and it presents

us with a problem.

Performance or clinical skills assessments are playing an increasingly important role in making certification or licensing

decisions, for example, using standardised patient assessments in simulated encounters or OSCEs. In order for the pass/

fail decisions to be fair and valid, justifiable standards must be set. However, the methodology is not as well developed

for performance standard setting as it is for knowledge based assessments and the influence that the assessor panel has

on the process appears to be a greater factor in this form of assessment. Thus, on the one hand there is a requirement

for the standards to be defensible, explicable and stable (28), yet reported problems with both the methods and their

implementation (30, 39).

Therefore, assessments of complex and integrated skills, which need to include assessments on the performance of whole

tasks, pose considerable standard setting challenges - in particular ensuring that the standard is stable over time (‘linear

test equating’). In order to set acceptable standards, it is a prerequisite that due care is taken to ensure that the assessments

are standardised, the scores are accurate and reliable, and the resulting decisions regarding competence are realistic, fair

and defensible.

The borderline group and contrasting groups methods of standard setting are well suited to standard setting in controlled

assessment systems testing skills and performance. A further method is reviewed below.

A proposed method for standard setting in skills assessments using a hybrid method

Assessors must observe either trainees’ actual performance - for example using a DVD, VCR or an authentic simulation

which includes a suitable breadth of trainee’s performance - and then make a direct assessment concerning competence

based on such observations.

It is imperative to emphasise the need to concentrate the assessor’s attention on the pass/fail decision to ensure that they

are properly informed as to the definition of ‘just passing’ behaviour.

For clinician standard setters, this task is intuitively appealing, as it articulates with their clinical experience. However, one

potential shortcoming of trainee centred methods, usually attributable to insufficient training of experts as assessors, is the

tendency to attribute performance based on skills or factors that are not directly targeted by the assessment. The attribution

of positive ratings based on irrelevant factors (halo effects - ‘he or she was very kind and polite’) is one such phenomenon.

This, and other potential sources of assessor bias, can be addressed by offering adequate training to assessors about

making judgments, including the provision of suitable performance descriptors. In addition, it is imperative that the

assessor’s task is clear and unambiguous and that any misinterpretation of the task is rectified.

Working as a group, assessors can discuss and establish a collective and defensible recommendation for what constitutes

a passing standard. The standard may then be modified by the application of additional criteria (see below) or appropriate

statistical management. The passing standard for each individual assessment, a clinical case, for example, is set by the

assessors as a result of their expertise, training and insights into the performance of ‘just passing’ trainees in real life.


1) Grading system for individual assessment of clinical cases

This section is based on the paper by Wakeford and Patterson (38) mentioned above, to whose work David Sales

contributed.

For pass/fail licensing decisions, designed to confirm that a doctor is sufficiently safe and proficient to undertake

unsupervised independent practice, the trainee’s performance on each of the skills assessments should be assessed in a

specified number of domains (such as history taking, examination, communication or practical skills, etc) and each of these

graded on a scale, for illustrative example, using four points as follows:

• clear pass;

• bare (marginal) pass;

• bare (marginal) fail;

• clear fail.

There will also be a global, overarching judgment for each individual assessment, which will be the overall grade for that

particular assessment. This overall individual assessment grade is not determined by the simple aggregation of the domain

grades, as that would imply equal weight to each. Although the individual domain grades may be taken into account, the

fact that one domain was weakly represented in that individual assessment will also need to be accounted for. For example

in resuscitation assessment, communication with the patient would be far less important than ensuring a clear airway.

The overall assessment grade for a particular case or scenario will use the same four grades but, since there is no

borderline grade, marginal fails and marginal passes can be seen as fails and passes respectively, by the assessor.

The essence of such a grading system is that it is based on expert assessments of what is acceptable behaviour in the

overall passing criteria of the assessment. The essential focus of the assessment is upon the trainees’ global performance

during a particular overall unit of assessment and not on their performance on the necessarily artificial constructs of the

domains within that individual assessment. It is an accurate overall assessment grade which is the key to its producing a

credible overall result.

2) Scoring methods

Such an assessment would produce a number of scores based on the four point system for any individual trainee.

Converting a number of scores into an overall result is not straightforward.

It raises the following issues:

• psychometric - especially of compensation vs combination;

• stakeholder - including what constitutes competence;

• institutional - including financial issues of pass rates;

• trainee - such as ‘case blackballing’.

In view of the complexity of these issues, it is not surprising that there is no easy answer to the problem of converting a

series of individual assessments scores into overall assessment results, particularly where there are a number of ‘marginal’

grades involved.

3) Is turning the results of assessments into a series of numbers likely to help?

It is clearly necessary to combine the total number of assessment scores in some way so as to produce an overall pass/fail

standard. This process must be fair and must ensure that unacceptable combinations of individual assessment ‘grades’ do

not lead to a pass.

The end can be achieved in two ways:

• A set of rules can be devised, which might say something like ‘to pass, a trainee must have at least n clear passes and no

clear fails’ or ‘allowing compensation between clear passes and clear fails, and between marginal passes and marginal

fails, the trainee shall pass if they have a neutral ‘score’ or above’, with possible codicils such as ‘… n clear fails will fail’.

This might be termed a categorical approach.

• Alternatively, a scoring system could be devised with different marks being given for each grade (such as 0, 1, 2, 3, or 4, 8,

10, 12) and a pass mark set.

The main difficulty with the former is that it does not produce a ‘score’ that could subsequently be processed statistically.

The difficulties with the latter are that, being non-specific, it may well not prevent the unacceptable combinations - there

will be argument about the relative scores attached to individual assessments.


In practice, if the passing ‘score’ approaches the maximum score, the numerical approach can possess all the advantages

of the categorical approach. In this situation the examiners could consider the scoring approach; otherwise a categorical

combination algorithm may be preferable.

Standard setting for workplace based assessment

Assessment in the workplace

Assessment of a trainee’s performance in the workplace is extremely important in ensuring competent practice and good

patient care, and in monitoring their progress and attainment. However, it is a relative newcomer to the assessment scene,

particularly in the UK, and is probably not particularly well suited to the kind of standard setting methods described above

and conventionally applied in well controlled ‘examining’ environments.

As mentioned earlier, work is underway to evaluate the contribution that the Holsgrove and Kauser Ali modification to Ebel’s

method might make in this area, but at present the most promising approach is probably to use anchored rating scales

with performance descriptors. This is particularly appropriate with competency based curricula where intended learning

outcomes are described in terms of observable behaviours, both negative and positive.

Anchored rating scales

An anchored rating scale is essentially a Likert-type scale with descriptors at various points - typically at each end and

at some point around the middle. For example, the rating scales used in most of the assessment forms in the Foundation

Programme, performance for each item at point 1 (the poorest rating), 6 (the highest) and 4 (the standard for completion)

might be deemed sufficient. To take an example from the assessment programme in the specialty curriculum of the

Royal College of Psychiatrists (http://www.rcpsych.ac.uk/training/workplace-basedassessment/wbadownloads.aspx)

the descriptors for performance at ST1 level in the mini-CEX (which RCPsych have modified for psychiatry training and

renamed mini-Assessed Clinical Encounter [mini-ACE]) describes performance in history taking at points 1, 4 and 6 on the

rating scale:

1) Very poor, incomplete and inadequate history taking.

4) Structured, methodical, sensitive and allowing the patient to tell their story; no important omissions.

6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all.

However, to help assessors to achieve accuracy and consistency, the performance descriptors used also describe the other

three points on the scale:

1) Very poor, incomplete and inadequate history taking.

2) Poor history taking, badly structured and missing some important details.

3) Fails to reach the required standard; history taking is probably structured and fairly methodical, but might be

incomplete though without major oversights.

4) Structured, methodical, sensitive and allowing the patient to tell their story; no important omissions.

5) A good demonstration of structured, methodical and sensitive history taking, facilitating the patient in telling

their story.

6) Excellent history taking with some aspects demonstrated to a very high level of expertise and no flaws at all.

It seems inevitable that standard setting in workplace based assessment will be an area of considerable research and

development activity over the next few years.

Decisions about borderline trainees

It is very important that there is a proper policy, agreed in advance, regarding the identification of borderline trainees and,

having identified them, what to do about them. The traditional practices will no longer suffice.

Making decisions about borderline trainees

Some common current practices for making decisions about borderline trainees are unacceptable and indefensible. For

example, the only groups of borderline trainees usually considered at present are those within, say, a couple of percentage

points of the pass mark (the range is typically arbitrary rather than evidence based). Borderline ‘passes’ are usually ignored

- they are treated as clear passes even though mathematically there is always a group who happen to fall on the ‘pass’ side

of the cutting point who, in terms of confidence intervals, are interchangeable with a similar group on the ‘fail’ side.

Decisions about what happens to borderline trainees once they have been correctly identified should rest with individual

assessment boards. However, they must be fair, transparent and defensible, and it is essential that borderline trainees on

both sides of the pass mark are treated in exactly the same way.


The criteria of fairness, transparency and defensibility would probably exclude one of the most common ways of making

decisions about borderline trainees, which is to conduct a viva voce examination. The vast majority of vivas are plagued

with problems, such as inconsistency within and between assessors, variability in material covered (not infrequently asking

about things that are not in the curriculum) and, above all, extremely poor reliability. It is clearly inappropriate to use

perhaps the least reliable assessment method to make pass/fail decisions that even the most reliable methods have been

unable to make.

Summary

This chapter is concerned with issues arising from the requirement for assessments to comply with the PMETB Principles

(1), and, in particular, the two questions that must be addressed in meeting Principle 4:

i) What is the measurement error around the agreed level of proficiency?

ii) What steps are taken to account for measurement error, particularly in relation to borderline performance?

The first step, identified in the title itself, is to establish what the level of proficiency actually should be - in other words, how

is the standard agreed?

This chapter has outlined some of the principles of standard setting and described, in Appendices 1 and 2, some methods

for establishing the pass mark for assessments. However, PMETB’s Principles require more than having a standard that has

been properly set. Clear and defensible procedures are needed for identifying and making decisions about borderline

trainees. This chapter also described how, even when the pass mark has been correctly set, there will almost certainly be a

group of trainees with marks on either side of it who cannot be confidently declared to have either passed or failed. This is

because all assessments, like all other measurement systems, inevitably have an element of measurement error.

Appendices 1 and 2 decribe how this measurement error can be calculated and noted that it is often surprisingly large. The

appendices serve to illustrate how measurement error can be reduced by improving the reliability of the assessment having

calculated the measurement error for their assessment - assessors can identify the borderline trainees. It also pointed out

that they will need to have agreed in advance how decisions about the borderline trainees will be made.

By breaking the requirements of Principle 4 down into three elements:

• How is the standard set?

• How is the measurement error calculated?

• How are borderline trainees identified and treated?

PMETB hopes that this chapter will be helpful in assisting those responsible for assessing doctors, both in the examination

hall and in the workplace, to ensure that their assessments meet the required standards.

Conclusion

The majority of methods of standard setting have been developed for knowledge (MCQ-type) tests and address the need

for setting a passing score within a distribution of marks in which there is no notional pre-existing standard.

Much of the currently published evidence relating to standard setting in performance assessments relates to undergraduate

medical examinations, which produce checklist scores that may have little other than conceptual relevance to skills tests

which assess global performance relevant to professional practice.

Regardless of the method used to set standards in performance assessments, it is imperative that data are collected both to

support the assessment system that was used and to establish the credibility of the standard. Generalisability theory (7, 8,

40, 41) can be used to inform standard setting decisions by determining conditions (e.g. number of assessors, number of

assessments and types of assessments) that would minimise sources of measurement error and result in a more defensible

pass/fail standard.

Where performance assessments are used for licensing decisions, the responsible organisation must ensure that passing

standards achieve the intended purposes (e.g. public protection) and avoid any serious negative consequences.


Chapter 3: Meeting the PMETB approved Principles of assessment

Introduction

PMETB’s Principles of assessment (1) are now well established. This chapter provides some of the background source

material which will permit interested groups to begin to understand some of the thinking behind the Principles. PMETB

is at pains to explain why this amount of work is demanded of already hard pressed groups who are trying, based on a

background of established practice, to provide robust evidence which supports what they do.

For the purposes of this section ‘quality management’ (QM) replaces the traditional term ‘quality control’. Given the

complexities of training and education as part of the delivery of medical services, our ability to directly control quality is

inevitably challenging. Quality and the risks of falling short of achieving the required standards inherent in the Principles,

however, must be managed. It is the role of postgraduate deaneries with training programme directors to manage quality. It

is the role of PMETB to assure QM takes place at the highest level achievable in the circumstances.

Quality assurance and workplace based assessment

Workplace based assessment should comply with universal principles of QM and quality assurance that one might expect

in any assessment. However, the position of workplace based assessment is somewhat different. This is not to say it should

not be quality managed but because of its nature there are issues concerning its best use. This might be summarised

as being an assessment methodology that has demonstrably high validity, whilst being more challenging in terms of its

reliability. However, in practice the evidence is that some workplace based assessments have very reasonable reliability.

Norcini’s work has demonstrated that mini-CEX can have acceptable reliability with six to ten separate but similar

assessments (based on 95% CIs using generalisability) (42). He emphasises the need to re-examine measurement

characteristics in different settings and the need for sampling across assessors and clinical problems on which the

assessments are based. The use of the 95% CI emphasises the need for more interactions where performance is borderline

in order to establish whether the trainee is performing safely or not. It is very important that the assessors are trained;

Holmboe described a training method which was in practice very simple in that it consisted of less than one day of intensive

training (43).

A number of groups have demonstrated that both multi-source feedback (MSF) from colleagues and patients can also be

defensibly reliable, although larger numbers of patient assessors are needed than colleagues in these assessments (44-

49). In the case of borderline trainees, however, it makes sense that more assessments are required to distinguish between

trainees who are in fact safe and those where doubts remain. Extensive sampling for borderline trainees may be needed

to precisely identify the problems behind their difficulties so that a plan can be formed to find remedial solutions where

possible.

The main value of workplace based assessments is that they provide immediate feedback. The information acquired

during a workplace based assessment can also provide evidence of progression of a trainee and therefore contribute

evidence suitable for recording in their learning portfolio. This can then be compared to the agreed outcomes set by trainer

and trainee in the educational agreement. It is essential that both trainer and trainee are aware that both feedback and

assessment of performance that contributes to their learning portfolio of evidence are simultaneously taking place during

workplace based assessment.

Although PMETB acknowledges that this dual role of a workplace based assessment is in some ways not ideal because

it may inhibit the learning opportunity from short loop feedback, it is necessary to be pragmatic, particularly because

the number of assessors and the time available for assessment is precious. Therefore, educational supervisors will

sometimes be tutors or mentors and on other occasions will actually be assessors. The agreement reached within the

PMETB Assessment Working Group is that this is acceptable provided it is entirely transparent to trainer and trainee in what

circumstances they are meeting on a particular occasion.


The learning agreement

Workplace based assessments taken in isolation are of limited value. They should be contextualised to a learning

agreement that refers to the written curriculum and sets the agenda for a particular training episode. A series of learning

agreements must ensure that the whole of the curriculum is covered by the end of training.

There will inevitably be small gaps and more overlaps, but by and large what trainer and trainee are creating is an

agreement on a direction of travel which is usefully thought of as an educational trajectory. The aim and objective of

the trajectory is to cover the whole of the curriculum to a level of competence defined by a series of outcomes, in this

case, based on GMP (19). Assessments will be used to provide evidence that the direction and pace of travel is timely,

appropriate and valid. They will inform the educational appraisal which in turn informs the specialty trainee assessment

process (STrAP) - referred to throughout this document as the annual assessment - which determines whether or not a

trainee is able to proceed to the next stage of their training.

Appraisal

After a suitable period of training, usually every four or six months, depending on the structure of training rotations, a

formative, low stakes, educational appraisal must take place based on the evidence provided by learning agreements and

a large number of in-the-workplace based assessments. PMETB recognises there will be different ways of achieving the

educational appraisal process that precedes submission of evidence to the annual assessment, which is a high stakes event

determining whether progression in training can take place, or remediation is required. Normally, assessments will occur

annually but in the case of remedial action being required for a trainee they may occur more frequently.

It is important, therefore, to have an opportunity for the trainee and the group of trainers a trainee has worked with during

the specified training period to review the evidence during an appraisal which is distinct and unequivocally separate from

the annual assessment described below. This, therefore, has to take place at school or programme level and is best carried

out locally where training during that period has actually taken place. This is an indisputably formative review with the

specific objective of ensuring there have been no immediate problems, such as the inability to have sufficient training or

assessment opportunities and to provide timely feedback when difficulties of any kind have arisen. The aim is to resolve

problems as soon as they are identified, rather than presenting difficulties which were otherwise remediable to the annual

assessment.

The reason for separation of appraisal from review is to ensure the subsequent annual assessment, which has external

members relative to the training process, does not simply consider the future of a trainee on the basis of the raw data of a

series of scores. It has been made clear elsewhere in this document that assessment must not simply be a ‘summation of the

alphas’ (referring to Cronbach’s alpha).

The way in which appraisal is separated from review will vary from programme to programme and from discipline to

discipline. For example, in anaesthesia, where there is a big pool of trainers with trainees frequently rotating amongst

them, the educational supervisor may not act as assessor for the whole or any of the training period. In many ways, this is

ideal as it keeps the mentoring role and the assessing activity completely divorced to the particular advantage of effective

mentorship. In trauma and orthopaedic surgery, on the other hand, trainees spend six months with a trainer and the vast

majority of the time the trainer will also be the workplace based assessor. Provided it is clear when the trainer/assessor is in

which role, then a practical way forward can be envisaged.

Some assessments may be carried out by other assessors, other than the principal designated trainer, during a particular

training period and over the year a trainee will have worked for at least two trainers who will also be acting as assessors.

A structured trainer’s report of the whole period of work ahead of the evidence being submitted to the annual assessment

must be made by the designated educational supervisor. This must contain evidence about the development of the trainee

in the round and not just a list of completed assessments; the latter are really presented as supporting evidence. This means

that the trainee is aware that the individual workplace based assessments contribute to the whole judgment and they would

not be ‘hung out to dry’ over one less than satisfactory assessment event. PMETB hopes this encourages the trainees and the

trainers to develop an adult-to-adult learning and teaching style, so maximising the learning opportunities inherent in work

based assessment.


The annual assessment

PMETB expects deans and programme directors to ensure that the

annual assessment (currently encompassed in RITA processes (50))

is a consistently well structured and conducted process. The annual

review process should have stakeholders from the deaneries, training

programmes, external assessors and lay membership. An example of

good practice is when a member of the appropriate SAC from out with

the training programme would attend the annual review to give the

process externality. Clearly, the postgraduate dean gives educational

externality from a particular programme and internally the training

programme director has an overview of a particular trainee.

The exact composition of annual assessment panels is laid out in the shortly

to be published Gold Guide which replaces the current Orange Guide.

The annual assessment is a high stakes event for a trainee, but should

contain no surprises if the educational appraisal process has worked

effectively. The annual assessment should be a quality assuring exercise

ensuring that the conclusions that the Head of Training and the trainers

have reached about a particular trainee during the training interval

are reasonable and the trainee has achieved the standards expected,

as described in the structured trainer’s report. The panel would look

at the evidence and would either confirm or might differ from the decision reached by the training programme director and

their committee about a particular trainee. Certainly, the annual assessment panel would have to assure themselves that the

evidence provided was appropriate. This will be very high stakes because the exercise might result in a trainee being removed

from training, being asked to repeat training or to have focused training. In the most part, this will be a paper or virtual exercise,

although the Gold Guide suggests the option of reviewing borderline trainees should always be applied.

The second part of the annual review will be a facilitatory event where the training programme director, in conjunction with

members of his or her training committee and the trainee, would agree on the content of the next period of training, based

on an overview of the trainee’s educational trajectory with the intention of completing more parts of the written curriculum.

This must be based on a face-to-face discussion with at least one designated trainer or mentor.

The role of PMETB in this process is to be assured that the QM mechanisms and the decision making are appropriate. It is

important that there is a demonstrably transparent and fair process for trainees and one that assures the general public that

those treating them are fit for their purpose.

Additional information

The annual assessment for trainees needs to take into account issues around health and probity not dealt with elsewhere.

In doing this, the review in effect provides all the required evidence for NHS appraisal processes. MSF may provide

information about health and probity.

Quality assuring summative exams

Colleges set exit examinations as a quality managing and assuring process, which triangulates with evidence from a

learning portfolio which includes workplace assessments and accumulated experience such as may be found in log books.

Inevitably, the artificially created environment of a summative college exam has less intrinsic validity. The corollary is

that there is considerably more control over reliability. This means that correlating exam results with workplace based

assessment is one indicator that the process is working well (an example of a process described as triangulation). Note,

however, exams and workplace based assessments are not at all comparable assessments. It would be wrong therefore

to assume one is the control or check on the other; they provide complementary information invaluable as part of a

triangulation exercise.

The exam must be as reliable as possible. What many formal exams are best at doing is testing knowledge and its

application. Psychometric analysis demonstrates clearly that MCQs of various types do this most reliably. With structuring,

clinicals and orals can contribute to formal examinations. Assessment experts such as Geoff Norman are not intrinsically

against orals and clinicals, but simply point out that they require considerable work to make them robust and they need

prolonged examination time to be sure they are reliable. There is one reason why there is a trend to move to much longer

assessments of knowledge, application of knowledge and clinical decision making.

Appraisal

Assessment

Annual review

Evidence


In order to quality assure an assessment system based on formal examinations, it is necessary to address the following:

• Purpose of exam - should be explicit to examiners and trainees, and available in comprehensible form to the general

public.

• Content of exam - should simply match the agreed syllabus and look to test at the level of ‘competent’, but might also

encourage excellence.

• Selection of assessment instruments used in the exam - should meet the utility criteria laid out in Chapter 1.

• Question, answers and marking scheme - should be clear that the marking is either normative or criteria based. The

standard should also be related to methodology and purpose. Most professional exams are criteria based.

• Standard setting procedures - should be selected and applied as explained above in Chapter 2.

• Examination materials - need to be of proper quality and available to all examiners. Materials and props should be

approved by examination boards and the introduction of new materials by individual examiners put through the same

standard setting processes as any other exam material or question. For example, unshared computer images of which an

individual examiner may be fond should only be permitted if the image meets criteria of standard, quality and viability

agreed by all assessors.

• Running the exam - should be reasonable and of equal quality for all trainees so that they can perform to the best of their

ability. Where clinical environments are used, the standards must not be compromised by everyday service activities

going on around the exam. It is better to have reserved or purpose designed facilities and not impose an exam, for

example, on a busy ward or clinic where unexpected events or schedules such as mealtimes or visiting impinge on the

exam environment.

• Conduct of assessors - should be scrutinised regularly in terms of behaviour and performance. An example of good practice

is to appoint examiner/exam assessors. These individuals should be experienced examiners who are appointed in open

competition and with the approval of their peers. Their role would include:

• making multiple visits to inspect the venue or observe trainee assessments;

• sitting unobtrusively as observers and not interfering;

• evaluating assessors against transparent criteria.

Useful feedback comments might concern:

• quotes and examples of positive and negative behaviour;

• interpersonal skills of assessors with trainee;

• level and appropriateness of questions;

• assessment technique.

They would be expected to prepare a report on the whole process and on individual assessors, which can be fed

back to conveners and other assessors to assure consistency of performance. Assessors of exams must, of course,

be trained to be fit for purpose.

The exam should also undertake a review of written policies available to trainees, assessors and as far as possible

the general public, which determines:

• marking and analysis of results;

• provision of feedback;

• selection of examiners and officials;

• test development.

It is also necessary to have policies on the following:

• examination security;

• data protection;

• documents, computers, firewalls and buildings;

• checking and distribution of results;

• plagiarism and cheating;

• malpractice by college or trainees;

• mobile phones and electronic devices;

• examiner training.


Summary

PMETB recognises there is no perfect solution. Assessing professionals in the exacting working environment of healthcare

means making some compromises, which PMETB accepts make some educators uncomfortable.

PMETB assessment principles are predicated on the overriding value that, provided assessment instruments are valid

and reliable, they need to assess people holistically and not just represent them as a set of results based on a battery of

assessments. The intellectual thrust of the Assessment Working Group in PMETB is to respect the role of peer assessment

from genuine experts. Provided these experts acknowledge they also need to learn how to be expert assessors and trainers

as well as expert clinicians, there is a genuine way forward.


Chapter 4: Selection, training and evaluation of assessors

Introduction

The role of assessor is an important and responsible one for which individuals should be properly selected, trained and

evaluated. Very importantly, individuals undertaking assessment should recognise that they are professionally accountable

for the decisions they make. All assessments, including work based assessments, must be taken seriously and their

importance for the trainee and in terms of patient safety fully acknowledged. Submission of assessment judgments which

are not actually based on direct observation/discussion by the assessor with the trainee (e.g. handing the form to the

trainee to fill in themselves, filling in a form on the basis of ‘I know you are OK’) is a probity issue with respect to GMP.

Honest and reliable assessment is also essential in enabling assessors to fulfil their responsibilities in relation to GMP.

Given that assessors are often trainers in the same environment, it is important that it is made clear to trainees when they are

acting as an assessor rather than a trainer.

Selection of assessors

Selection of assessors should be undertaken against a transparent set of criteria in the public domain and therefore

available to both assessors and trainees. Particularly in relation to work based assessment this may include guidance in

relation to assessor characteristics such as grade or occupational group.

Criteria for selection of assessors may include:

• commitment to the assessment process they are participating in;

• willingness to undergo training;

• willingness to have their performance as an assessor evaluated and to respond to feedback from this;

• up-to-date, both in their field and in relation to assessment processes;

• non-discriminatory and able to provide evidence of diversity training;

• understanding of assessment principles;

• willing and able to contribute to standard setting processes;

• wiling and able to deliver feedback effectively;

• willing and able to undertake assessments in a consistent manner.

Assessor training

There is evidence that assessor training enhances assessor performance in all types of assessment (43, 51). All assessment

systems should include a programme of training for assessors. It is recognised that for some types of assessment - in

particular, large scale work based assessment - delivery of face-to-face training for all assessors is likely to take some time

but as a minimum, written guidance and an explicit plan to deliver any necessary additional training should be provided.

All assessor training should be seen as a natural part of Continuing Professional Development (CPD) and based on

evidence (52). Evaluation of assessor training should be integral to the training programme and where concerns/gaps

are raised in relation to training, these should be responded to and training modified if necessary. Cascade models for

training of assessors where centralised training is provided and then cascaded out at a more local level are attractive and

cost efficient, but ensuring standardisation of training is more difficult in this context. Provision of written/visual training

materials and observation of local training, will help achieve as much consistency as possible.

Assessor training should include:

• an overview of the assessment system and specifics in relation to the particular area that is the focus of the training;

• clarification of their responsibilities in relation to assessment, both specifically and more generally in terms of

professional accountability;

• principles of assessment, particularly with reference to the assessment process they are participating in, e.g. assessors

participating in a standard setting group will need training specifically in standard setting methodologies. Assessors for

work based assessment will need to understand the principles behind work based assessment;

• diversity training to ensure that judgments are non-discriminatory (or a requirement for this in another context);

• where assessors have a role which requires them to give face to face feedback to trainees, the importance of the quality of

this feedback should be emphasised and provision for training in feedback skills made.

Ongoing training for assessors should be provided to ensure that they are up-to-date and CPD approval should be sought

for this.


Feedback for assessors

Evaluation of assessor performance and provision of feedback for assessors should be planned within the development

of the assessment system. This should include feedback both on their own performance as an assessor and feedback on

the QM of the assessment process they are involved in. Feedback for assessors should include formal recognition of their

contribution (i.e. a ‘thank you’). Assessors are largely unpaid and give their time in the context of many other conflicting

pressures.

Planning for evaluation of the assessors should include mechanisms for dealing with assessors about whom concerns are

raised. In the first instance, this would usually involve the offer of additional training targeted at addressing the area(s) of

concern.


Chapter 5: Integrating assessment into the curriculum - a practical guide

Introduction

Assessment is a necessary process to assure the profession, public and regulatory authorities that practitioners are capable

of offering the highest quality of healthcare. It is essential in this process that assessments are valid and specialty relevant.

This chapter addresses these issues from a practical point of view. There are a number of general points that PMETB would

encourage assessing organisations to take into account when employing assessment tools.

• The assessment tools that any assessing organisation might choose are not prescribed by PMETB. All PMETB asks is that

the use of any one tool can be justified on the grounds of its ‘fitness for purpose’ and validity as discussed in Chapter 1.

• PMETB recognises that there is no ‘perfect’ assessment instrument and any single assessing organisation will need to use

a range of instruments, each with their own reliability/validity, to ensure the curriculum is adequately assessed.

• In order to cover all aspects of the curriculum within the framework of the GMC’s GMP, the range assessment instruments

must include those applicable to both exam based and workplace based assessments to ensure that the full scope of

knowledge, skills and attitudes is assessed.

• PMETB recognises that in any specific assessment setting it will not be possible to assess the entire curriculum and that

sampling will be a concern to assessment organisations and trainees alike. It is worth recalling that assessments are

as much tools for driving learning as assessing knowledge, skills or understanding. Sampling is entirely appropriate,

providing the weighting of that sampling process can be justified.

• PMETB encourages the use of assessment instruments which fit in naturally with normal clinical practice in the workplace.

• PMETB would also like to suggest that a good assessment instrument should be able to assess a doctor’s ability to support

self care.

Blueprinting

Blueprinting of assessments against the curriculum and GMP is important for any organisation, as it ensures all aspects of

the curriculum and GMP are covered over a period of time defined (and justified) by that organisation; this is the process

of sampling. GMP is the chosen anchor of PMETB and it provides a baseline or benchmark against which everything else

can be planned and evaluated. There are alternatives to GMP such as CanMEDS, which are also worthy. However, after wide

consultation the GMC has set its own standards down as GMP, which has evolved and matured over the years and which is

familiar to every doctor and clinical medical student in the UK.

When blueprinting, PMETB expects assessment organisations to choose an appropriate assessment system that overall

ensures each attribute of GMP is being tested. For example, MCQs may be appropriate for testing knowledge and practical

tests for testing skills. Overlap between methods is inevitable and may even be desirable in providing confirmation of

performance through triangulation.

PMETB requires that any assessment organisation setting or overseeing an assessment should be able to show, if called

upon, that its assessments conform to specific criteria. Specifically, they should be able to show that every assessment has

been designed to test a particular aspect of the curriculum or an appropriate element of GMP and that assessments are

weighted in respect of clinical importance.

PMETB also requires that the assessment organisation provides evidence that outcome measures are appropriate.

The outcome for an individual trainee or group of trainees should be justified and can be benchmarked against the

performance of an optimised trainee group.

Sampling

PMETB has realistic expectations and fully accepts that, even over the full training life of an individual trainee, it will not

usually be possible to assess all aspects of the curriculum, even by using a range of methodologies. It is therefore important

that in sampling aspects of a trainee’s skills, knowledge and understanding, there is an appropriate balance of subjects from

the curriculum being assessed.

Although PMETB does not expect any one trainee to be assessed on the whole curriculum, there is a requirement that over a

period of time an organisation can provide evidence that all aspects of the curriculum have been sampled. This is important

because PMETB is anxious to instil in trainees the appreciation of having a breadth of knowledge and skills, and recognises

that evidence of the possibility that any aspect of the curriculum/GMP might be assessed drives relevant learning.


The assessment burden (feasibility)

PMETB recognises that the frequency of assessments should not be as excessive as to overburden the trainee, or to exhaust

the assessors or the assessment system. As Dame Onora O’Neill stated in her 2002 BBC Reith Lecture: “Plants don’t flourish

when we pull them up too often to check how their roots are growing.”

There is a requirement to be able to balance a number of issues in this context and PMETB will expect assessment

organisations to be able to show that they understand the need to recognise:

• There is a tension between having a large number of tests in which the organisation is confident and their overuse to the

point that the trainee is overburdened.

• There is always to some degree a conflict between validity and reliability. It is important to avoid the danger of focusing

too much on reliability at the expense of important attributes that cannot easily be assessed using traditional examination

based methods.

• Whilst it is important to summate good performances, assessment systems are required to assess global professional

judgments and expose dangerous weaknesses, even if they represent a minority of the decision making outcomes.

• Sufficient time needs to be given between assessments for the trainee to reflect on his/her performance and to allow this

to be reinforced through its application in further clinical practice.

Feedback

PMETB sees appropriate feedback as being at the heart of assessment. Feedback must be provided from all assessments

and the assessment organisation must be able to demonstrate that the feedback that has occurred is appropriate and that it

has been given in a timely and useful manner.

‘Timeliness’ will vary depending on the nature of the assessment, but as a rule feedback should always be given as soon as

possible after the assessment.

Feedback should also be such as to demonstrate that the trainee has provided evidence of competence if this is the case or,

if not, define a framework suitable for a trainee to use as the basis of acquiring the necessary competence.

Wherever possible, feedback to the trainee should include identifying those areas of the assessment in which the trainee

has shown mastery or excellence to provide them with an understanding not only of their weaknesses, but also their

strengths.

Whenever feedback is given it should be done in such a way that an external observer would be reassured that the

outcome of the assessment had reflected the trainee’s skills within a framework built on the principle that public well-being

was the prime driver of medical education and practice.


Chapter 6: Constructing the assessment system

IntroductionThis chapter shows:

• how application of the previous five chapters might lead to a ‘blueprint’ of assessment methodologies, which would collectively address the needs of the training programme.

Purposes

The assessment system will begin by defining the purposes for which assessment is needed. There will be a range of

purposes at every level in the training programme, including: developing skills, developing insight, testing knowledge,

being certain of minimum competence, checking actual day-to-day performance, etc. Assessment may even have the

purpose of enhancing the organisation, rather than just the individual (e.g. the desire to bed in the desired attributes into

normal practice).

The definition of the purposes will be firmly based upon the GMC’s GMP, which dictated the desired attributes of all UK

doctors (and thus of the systems to which they belong).

The main areas (‘domains’) of GMP relevant to assessment are:

Good clinical care Keeping up-to-date

• clinical assessment; • specialty based knowledge;

• treatment; • procedural and technical skills;

• insight; • understanding evidence based medicine.

• record keeping;

• fairness.

Maintaining and improving performance Teaching, training, appraising and assessing

• involvement in audit and quality improvement; • commitment to educational activities;

• promoting the patient’s self care; • understanding medical education principles.

• promoting patient safety.

Starting with the purposes of the training programme, and keeping these constantly in mind, methods of assessment must

then be chosen. In selecting methods, it is useful to consider the various modalities of assessment which exist. It is worth

checking what is already known before launching into the creation of a new assessment instrument. It is emphasised that

the reason for such background work is not simply that existing tools can be ‘borrowed’. It is vital to understand that an

assessment which has been validated in another setting cannot be assumed to be valid or reliable in a different setting. It

is rather that taking account of the experience of others in what does and what does not work, can save having to make the

same mistakes twice.

Under the auspices of the Academy of Medical Royal Colleges, individual Colleges have been invited to share the work they

have done and are doing on assessment methodologies. This is being collated on a website within Modernising Medical

Careers (MMC) - A compendium of assessments. As well as pointing to areas of good practice, this site will allow Colleges to

post their ‘work in progress’ so as to encourage collaboration and learning, both about the assessment instruments and their

evaluation.

Assessment tools are of a number of qualitatively different types, each of which tends to be used in a different setting. It is

useful to consider these different categories, as such consideration will make it more likely that the complete assessment

system is adequately comprehensive.

One classification of assessments

One classification of assessments is to consider how they are conducted in the life of the trainee. There may be assessment of:

1) A real (medical) patient encounter This may be done in the work setting (e.g. mini-CEX, mini-ACE), or it may be a video of such an encounter reviewed

later. The feature of this type of assessment is that it is real (and therefore difficult to standardise) and that typically an

Educational Supervisor might conduct the assessment on a one-to-one basis with the trainee. However, other assessors may

sometimes be used, such as other professionals in the team or the patients themselves.


2) Direct observation of a skill This category is again assessment of real life activities, where the focus of the assessment is the skill with which the activity

was performed, e.g. technical skills (DOPS, OSATS), teaching skills and presentation skills. The consistent feature is that one

or more assessors, who are trained in the assessment of that skill, make a judgment about a real life performance.

3) Behaviour over time These assessments ask multiple observers to assess behaviour, typically with regard to generic attributes such as team

working, verbal communication and diligence. They share the feature that they are a collection of retrospective and

subjective opinions of key professionals based on observation over a period of time. These are usually referred to as MSF

(e.g. mini-PAT, TAB).

4) Behaviour in a real situation or environment The focus of this type of assessment would not be an individual patient, but rather the trainee’s management of the whole

situation. These types of assessments tend to look at behaviours and skills such as teamwork, clinical prioritisation,

‘situational awareness’, clinical leadership, etc. Typically, such assessments are performed in busy acute settings such as a

labour ward or an emergency room.

5) Discussion of clinical materials These assessments are usually performed in the absence of patients on a one-to-one basis, often outside the clinical setting.

There is a large variety of materials which may be used for such discussion, e.g. case notes (CBD), charts (CSR), videos

(VSR), etc. Because this type of assessment is based on actual materials, it is easier to standardise.

6) Simulation Sometimes the system requires that skills will be developed which need to be tested more often than the clinical realities

will allow, or which need to be checked before the trainee is allowed to practice them. In such circumstances, the trainee

might be assessed in a simulated setting. Because simulation allows more standardisation, this modality is appropriate

for a variety of areas, such as practical procedures, using models or manikins; communication assessment using standard

patients; teamwork assessment using simulated situations, etc.

Simulations, although having the advantage of reproducibility, have the disadvantage of being less real than real life.

Assessments by computer simulation in some areas are beginning to address these needs.

7) Cognitive assessments These share the feature that a group of trainees may be assessed simultaneously, typically by means of a written test such as

MCQ, EMQ, CRQ, etc.

8) Reflective practice These assessments are not assessments of actual performance, skill or knowledge, but instead they assess the trainee’s

insight when reflecting on these things. Materials on which the trainee might reflect include portfolios, clinical letter writing

(SAIL), CAEs and other case events, etc.

It is possible to use this classification to produce a ‘matrix’, where these categories appear along the x-axis of a grid, the y-

axis being formed by the domains of GMP (see Appendix 4). This allows a quick visual check to ensure that all the domains

of GMP have been appropriately addressed within the assessment system.

This classification is simply intended to ‘pigeonhole’ assessments into various types so as to make it easier to share practice

and to compare what is being done. It does not in any way intend to dictate that an assessment system must contain all of

these types. Very importantly, this categorisation does not attempt to define the purposes to which the assessment is put.

In fact, it is the purpose which must come first and the choice of assessment method second. PMETB regards this as the first

principle of constructing assessment systems.

Once methods have been selected, consideration then has to be given to:

• ensuring that they fit the principles of utility (Chapter 1);

• standard setting (Chapter 2);

• quality assurance (Chapter 3);

• selection and training of assessors (Chapter 4);

• integration of assessment into the curriculum (Chapter 5).


References

1. PMETB. Principles for an assessment system for postgraduate medical training. 2004. Available from: www.pmetb.org.uk/

pmetb/publications/

2. van der Vleuten C. The assessment of professional competence: developments, research and practical implications.

Advances in Health Sciences Education. 1996; 1: 41-67.

3. Schuwirth L, van der Vleuten C. How to design a useful test: the principles of assessment. Edinburgh: ASME. 2006.

4. Schuwirth LW, Southgate L, Page GG, Paget NS, Lescop JM, Lew SR, et al. When enough is enough: a conceptual basis for

fair and defensible practice performance assessment. Medical Education. 2002 Oct; 36(10): 925-30.

5. van der Vleuten CP, Schuwirth LW. Assessing professional competence: from methods to programmes. Medical

Education. 2005 Mar; 39(3): 309-17.

6. Downing SM. Reliability: on the reproducibility of assessment data. Medical Education. 2004 Sep; 38(9): 1006-12.

7. Crossley J, Davies H, Humphris G, Jolly B. Generalisability: a key to unlock professional assessment. Medical Education.

2002 Oct; 36(10): 972-8.

8. Cronbach L, Shavelson R.J. My current thoughts on coefficient alpha and successor procedures. Educational and

Psychological Measurement. 2004 June; 64(3): 391-418.

9. Streiner D, Norman G. Health Measurement Scales: A Practical Guide to their Development and Use. 2nd ed. New York:

Oxford University Press. 1995.

10. Newble D, Jolly B, Wakeford R, editors. The Certification and Recertification of Doctors: Issues in the Assessment of

Clinical Competence. Cambridge University Press. 1994.

11. Downing SM. Validity: on meaningful interpretation of assessment data. Medical Education. 2003 Sep; 37(9): 830-7.

12. Crossley J, Humphris GM, Jolly B. Assessing health professionals: introduction to a series on methods of professional

assessment. Medical Education. 2002; In press.

13. Dauphinee D, Fabb W, Jolly B, Langsley D, Wealthall S, Procopis P. Determining the content of certifying examinations.

In: Newble D, Jolly B, Wakeford R, editors. The certification and recertification of Doctors: Issues in the assessment of

clinical competence: Cambridge University Press. 1994; 92-104.

14. Miller G. The assessment of clinical skills/competence/performance. Academic Medicine. 1990; 65(Suppl): S63-S7.

15. Bloom B. Taxonomy of educational objectives. London: Longman. 1965.

16. Eraut M. Developing Professional Knowledge and Competence. London: Falmer Press. 1994.

17. Bridge PD, Musial J, Frank R, Roe T, Sawilowsky S. Measurement practices: methods for developing content-valid

student examinations. Medical Teacher. 2003 Jul; 25(4): 414-21.

18. Roberts C, Newble D, Jolly B, Reed M, Hampton K. Assuring the quality of high-stakes undergraduate assessments of

clinical competence. Medical Teacher. 2006 Sep; 28(6): 535-43.

19. GMC. Good Medical Practice. 2006 [cited 2002 October]. Available from: http://www.gmc-uk.org/standards/

20. Swanson D, Norman G, Linn R. Performance-based assessment: Lessons learnt from the health professions. Education

Research. 5-11: 24(5).

21. Swanwick T, Chana N. Workplace assessment for licensing in general practice. British Journal of General Practice. 2005;

55: 461-7.

22. Dixon H. Trainees’ views of the MRCGP examination and its effects upon approaches to learning: a questionnaire study

in the Northern Deanery. Education for Primary Care. 2003; 146-57; 14.

23. Livingston SA, Zieky MJ. Passing scores: a manual for setting standards of performance on educational and occupational

tests. Princeton: Educational Testing Service. 1982.

24. Glass G. Standards and criteria. Journal of Educational Measurement. 1978; 15(4): 237-61.

25. Cizek G. Standard Setting. In: Downing S, Haladyna T, editors. Handbook of Test Development. Mahwah, NJ: Lawrence

Erlbaum; 2006; 225-57.


26. Cusimano M. Standard setting in medical education. Academic Medicine. 1996; 71(Suppl.): S112-20.

27. Downing S, Tekian A, Yudkowsky R. Procedures for establishing defensible absolute passing scores on performance

examinations in health professional education. Teaching and Learning in Medicine. 2006; 18(1): 50-7.

28. Norcini JJ. Setting standards on educational tests. Medical Education. 2003 May; 37(5): 464-9.

29. Berk R. Standard setting The next generation: Where few psychometricians have gone before. Applied Measurement in

Education. 1993; 9(3): 215-35.

30. Boulet JR, De Champlain AF, McKinley DW. Setting defensible performance standards on OSCEs and standardized

patient examinations. Medical Teacher. 2003 May; 25(3): 245-9.

31. Holsgrove G. Principles of assessment. In: Whitehouse C, Roland M, Campion P, editors. Teaching medicine in the

community. Oxford: Oxford University Press. 1997; 183-5.

32. Angoff W. Scales, norms and equivalent score. In: Throndike R, editor. Educational Measurement. Washington DC:

American Council on Education. 1971; 508-600.

33. Ebel R. Essentials of educational measurement. Englewood Cliffs, NJ: Prentice-Hall. 1972.

34. Boulet J, de Champlain A, McKinley D. Setting defensible performance standards on OSCEs and standardized patient

examinations. Medical Teacher. 2003; 25:

35. Holsgrove G, Kauser Ali S. Quality assurance, standard-setting and item banking in professional examinations. College

of Physicians and Surgeons: Pakistan. 2004.

36. Rothman A, Cohen R. A comparison of empirically and rationally defined standards for clinical skills checklist.

Academic Medicine. 1996; 71(Suppl): S1-30.

37. Clauser B, Clyman S. A contrasting groups approach to standard setting for performance assessments of clinical skills.

Academic Medicine. 1994; 69(10 Suppl): S42-4.

38. Wakeford R, Patterson F. The MRCGP Clinical Skills Assessment Standard setting and related quality issue. 2006.

39. Kaufman D, Mann K, Muijtkens A, van der Vleuten C. A comparison of standard setting procedures for an OSCE in

undergraduate medical education. Academic Medicine. 2000; 75: 267-71.

40. Streiner D, Norman G. Generalizability. Health Measurement Scales: A Practical Guide to their Development and Use.

3rd ed. New York: Oxford University Press. 2003; 128-43.

41. Brennan R. Generalizability Theory. New York: Springer Verlag. 2001.

42. Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method for assessing clinical skills. Annals of Internal

Medicine. 2003 Mar 18; 138(6): 476-81.

43. Holmboe ES, Hawkins RE, Hout S. Effects of training in direct observation of medical residents’ clinical competence; a

randomized trial. Annals of Internal Medicine. 2004; 140: 874-81.

44. Ramsey PG, Wenrich MD. Peer ratings. An assessment tool whose time has come. Journal of General Internal Medicine.

1999 Sep; 14(9): 581-2.

45. Lockyer JM, Violato C, Fidler H. A multi source feedback program for anaesthesiologists. Canadian Journal of

Anaesthesia. 2006 Jan; 53(1): 33-9.

46. Lockyer J. Multisource feedback in the assessment of physician competencies. The Journal of Continuing Education in

the Health Professions. 2003 Winter; 23(1): 4-12.

47. Crossley J, Davies H, Eiser C. The measurement characteristics of children’s and parents’ ratings of the doctor-patient

interaction: measuring what matters well. Archives of Disease in Childhood. 2003; 88(Suppl 1): A50.

48. Archer JC, Norcini J, Davies HA. Use of SPRAT for peer review of paediatricians in training. British Medical Journal. 2005

May 28 ; 330(7502): 1251-3.

49. Archer J, Norcini J, Southgate L, Heard S, Davies H. mini-PAT (Peer Assessment Tool): A Valid Component of a National

Assessment Programme in the UK? Advances in Health Sciences Education: Theory and Practice. 2006 Oct 12.

50. NHSE. A Guide to Specialist Registrar Training. 1998.

51. Khera N, Davies H, Lissauer T, Skuse D, Wakeford R, Stroobant J. How should paediatric examiners be trained? Archives

of Disease in Childhood. 2005 Jan; 90(1): 43-7.


52. Davis D. Does CME work? An analysis of the effect of educational activities on physician performance or health care

outcomes. International Journal of Psychiatry in Medicine. 1998; 28(1): 21-39.

53. Wood R. Assessment and Testing: a survey of research. Cambridge: University of Cambridge Local Examination

Syndicate. 1991.

54. Streiner D, Norman G. Health Measurement Scales: A Practical Guide to their Development and Use. 3rd ed. New York:

Oxford University Press. 2003.

Further reading

Angoff WH. Scales, norms and equivalent scores in Educational Measurement, Ed. Throndike RL. Washington DC: American

Council on Education. 1971; 508-600.

Berk RA. Standard setting, The next generation: Where few psychometricians have gone before. Applied Measurement in

Education. 1993; 9(3); 215-235.

Boulet JR, de Champlain AF, McKinley DW. Setting defensible performance standards on OSCEs and standardized patient

examinations. Medical Teacher. 2003; 25: 245-249.

Brailovsky CA, Grand’maison P, Lescop J. A large-scale multicenter objective structured clinical examination for licensure.

Academic Medicine. 1992; 67(10 Suppl); S37-S39.

Brennan RL. Generalizability Theory (New York: Springer- Verlag). 2001.

Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. Philadelphia, PA: National

Board of Medical Examiners. 1996.

Champlain A (2004) Ensuring that the competent are truly competent: an overview of common methods and procedures

used to set standards on high stakes examinations. Journal of Veterinary Medical Education. 2004; 31(1); 2004.

Cizek G. Standard Setting (2006) in Downing S and Haladyna TM (Eds) Handbook of Test Development. Mahwah, NJ:

Lawrence Erlbaum. 2006; Chapter 10.

Chinn RN, Hertz NR. Alternative approaches to standard setting for licensing and certification examinations. Applied

Measurement in Education. 2002; 15(1); 1-14.

Clauser BE, Clyman SG. A contrasting groups approach to standard setting for performance assessments of clinical skills.

Academic Medicine. 1994 Oct; 69(10 Suppl): S42-4.

Cusimano MD. Standard setting in medical education. Academic Medicine. 1996; 71(Suppl.): S112-120.

Downing S, Lieska N, Raible M. Establishing Passing Standards for Classroom Achievement Tests in Medical Education: A

Comparative Study of Four Methods. Academic Medicine. Research in Medical Education: Proceedings of the Forty-second

Annual Conference. 2003 Oct; 78(10) Suppl: S85-S87.

Downing S, Tekian A, Yudkowsky R. Procedures for establishing defensible absolute passing scores on performance

examinations in health professional education. Teaching and Learning in Medicine. 2006; 18, 1: 50-57.

Ebel RL. Essentials of educational measurement. Englewood Cliffs, NJ: Prentice-Hall. 1972.

Glass GV. Standards and criteria. Journal of Educational Measurement. 1978; 15, 4; 237-261.

Holsgrove G. Principles of assessment; in Teaching medicine in the community (Editors: Whitehouse C, Roland M, Campion

P). Oxford: Oxford University Press. 1997a; Chapter 28; 183-185.

Holsgrove G. Assessing knowledge, in Teaching medicine in the community (Editors: Whitehouse C, Roland M, Campion P).

Oxford University Press. 1997b; Chapter 29; 186-194.

Holsgrove G, Kauser Ali S. Quality assurance, standard-setting and item banking in professional examinations. Internal

document for the College of Physicians and Surgeons Pakistan. 2004.

Jolly B, Grant J (Eds). The good assessment guide. Joint Centre for Education in Medicine. 1997.

Kaufman DM, Mann KV, Muijtkens AM, van der Vleuten CP. A comparison of standard setting procedures for an OSCE in

undergraduate medical education. Academic Medicine. 2000; 75: 267-271


Kramer A, Muijtjens A, Jansen K, Dusman H, Tan L, van der Vleuten C. Comparison of a rational and an empirical standard

setting procedure for an OSCE. Medical Education. 2003; 37: 132-139.

Livingston S A, Zieky M J. Passing scores: a manual for setting standards of performance on educational and occupational

tests. Princeton, NJ: Educational Testing Service. 1982.

Norcini J. The metric of medical education: setting standards on educational tests. Medical Education. 2003; 37; 464-9.

PMETB. Principles for an assessment system for postgraduate medical training. 2004. Available from: ww.pmetb.org.uk/

pmetb/publications/

Rothman AI, Cohen R. A comparison of empirically and rationally defined standards for clinical skills checklists. Academic

Medicine. 1996; 71(Suppl): S1-30.

Royal College of Psychiatrists. Workplace based assessment materials (2006). Available from: http://www.rcpsych.ac.uk/

training/workplace-basedassessment/wbadownloads.aspx

Schuwirth LW, van der Vleuten CP (in print). How to design a useful test: the principles of assessment. Understanding

Medical Education. Association for the Study of Medical Education.

Streiner DL, Norman GR. Health Measurement Scales (3rd edition). Oxford University Press. 2003.

Swanson DB, Norcini JJ. Factors influencing reproducibility of tests using standardized patients. Teaching and Learning in

Medicine. 1989; 1(3): 158-166.

Wakeford R, Patterson F. The nMRCGP Clinical Skills Assessment Standard Setting and Related Quality Issues. Paper to the

RCGP/COGPED Assessment Group. 2006.

Wilkinson TF, Newble DI, Frampton CM. Standard setting in an objective structured clinical examination: use of global

ratings of borderline performance to determine the passing score. Medical Education. 2001; 35: 1043-1049.

Wood R. Assessment and Testing: a survey of research. Cambridge, University of Cambridge Local Examination Syndicate.

1991.


Appendices

Appendix 1: Reliability and measurement errorHaving set a passing standard, the measurement error associated with the assessment (which is another PMETB

requirement) can be established and a protocol to identify and appropriately manage borderline trainees formulated. This

involves calculating the reliability and Standard Error of Measurement of the assessment.

All measurement methods have a margin of error. In some instances quite a large measurement error can be acceptable. For

example, our bathroom scales might quite reasonably have a measurement error of up to 1 kg and still be fit for routine use.

Such a margin would be quite unacceptable for weighing babies, however, even though both types of scales are making

measurements in the same domain - mass.

Since both formal and workplace based assessments are measuring something (clinical competence, for example) they are,

therefore, not exempt from the universal rule that all measurement methods have an associated margin of error. Assessors

often overestimate the accuracy of marks awarded in their assessments, but in fact, the measurement error of many exams

is uncomfortably large. Historically, this has not bothered assessors unduly because the measurement error of UK exams is

rarely calculated, or even acknowledged.

This situation will change. In order to comply with PMETB requirements on quality assurance, quality control and the

assessment system (Principle 4), assessing bodies must address two specific questions:

i) What is the measurement error around the agreed level of proficiency?

ii) What steps are taken to account for measurement error, particularly in relation to borderline performance?

Moreover, the answers to these questions must be transparent and in the public domain.

In order to provide satisfactory answers to these questions it is necessary to do three things:

a) calculate the measurement error of the assessment (either as a whole, or for each component part if each is considered

separately);

b) agree a policy for determining how the borderline trainees will be identified and how you will be making pass/fail

decisions about them;

c) implement strategies for reducing measurement error, where this is necessary.

Reliability

The term reliability has a specific and rather complicated meaning in relation to the mathematical performance of

assessment instruments. In this context reliability is concerned with the accuracy with which the trainee’s performance is

determined and reported. Readers seeking additional information would be interested in the excellent coverage by Wood

(53), Streiner and Norman (54) and in various superb publications by Lee Cronbach.

Reliability has two components and is expressed as:

Reliability = Subject Variability

Subject Variability + Measurement Error

Reliability is typically reported as a coefficient called coefficient alpha, or Cronbach’s alpha in honour of its inventor.

In essence, an alpha value expresses the amount of variance between trainees that is genuinely due to true differences

between them and, therefore, also shows us how much of the variability in the marks is not actually due to differences

between the trainees but to other sources of variance such as inconsistencies between assessors and random error. Thus,

an alpha of 0.6 would indicate that 60% of the measured variance was due to genuine differences between trainees (and,

therefore, that 40% was not). Traditionally, the accepted minimum value for alpha in an examination has been 0.8. This

remains the benchmark below which an exam or elements within it should not fall. However, there is a consensus among

medical educationalists that high stakes assessments, such as most of the Royal College examinations, should have a

reliability of at least 0.9. That said, it is not necessary to go much beyond 0.9 because of the increasing likelihood that this

would mean that the exam was testing more or less the same thing in slightly different ways.


Measurement error

The reliability coefficient involves two components: the true variance between trainees (which is what is really required

- at least to the extent that there is confidence that the right trainees have passed or failed) and additional variance due to

measurement error, which needs to be measured so that it can be compensated for. These two components also need to be

identified for test development and quality assurance purposes.

In exam analysis, measurement error is calculated and reported as the standard error of measurement (SEM). This is done

using the simple formula:

SEM = Standard Deviation √ 1 - Cronbach’s alpha

In terms of assessment development, the SEM can help in identifying individual assessments that need to be improved,

though the reliability coefficient is more important in this regard. The main use of the SEM, however, is to enable the proper

identification of the borderline trainees - those whom the examination has not been able to confidently place on one side or

the other of the pass mark.

a) The standard error of measurement and confidence intervals

The SEM forms the basis on which the range of marks that would determine the group of borderline trainees that poses

a similar problem in every examination can be calculated. This is because the SEM equates with the confidence interval

for the marks. For example, 1 SEM represents a confidence interval of 68%. In other words, 68% of the time a trainee’s

‘true’ mark would be within ± 1 SEM of the mark they obtained in the test - or, to put it the other way, there is about a 1 in

3 chance that their exam mark was not even within 1 SEM of their ‘true’ mark. However, since the passing standard itself is

also associated with errors (for example, as discussed above, different methods and different assessors arriving at different

passing scores), a 68% confidence interval is probably adequate for determining borderline trainees.

b) Identifying borderline trainees

The use of SEMs in determining passing, failing and borderline trainees can be illustrated by taking a hypothetical

assessment where the pass mark (determined, of course, by one of the methods described above) is 50% and the standard

deviation is 10.

If the reliability of the assessment was 0.8, the SEM would be 4.47. Based on a confidence interval of 68% (i.e. using 1 SEM),

the borderline trainees would be those with marks of 50% ± 4.47 - this means, between 45.53% and 54.47%. Confidence

can be better than 68% that trainees with marks above 54.47% really had passed and those below 45.53% really had failed

(more confidence could be felt the further above or below these two points). However, the hypothetical assessment would

not have been able to place trainees with marks in the 45.53% to 54.47% range on the correct side of the pass mark, even at

only a very modest 68% confidence level.

Since the SEM is partly dependent on reliability, it is obvious that the SEM will be smaller in a reliable assessment than in an

unreliable one with the same standard deviation. Consequently, one way of reducing the SEM (and, thus, confidently placing

a greater proportion of trainees on the correct side of the pass/fail cutting point) is to improve the reliability. There are

several ways of doing this, of which the main methods are:

• increase testing time/number of items;

• produce better items (questions);

• improve examiner training;

• improve marking schedules;

• use optical mark reading or computer based testing to minimise marking errors;

• reject badly performing items from the examination before calculating final marks.


The box below illustrates the effect of improving reliability on the SEM by extending the example given above which was

based on a reliability of 0.8. Below are comparative figures for the same scenario in our hypothetical exam using the same

50% pass mark and reliabilities of 0.8 and 0.9:

Reliability(Cronbach’s alpha) Standard Error of Measurement

Borderline trainees will have marks within this zone

(based on 1 SEM)

0.6 6.32 43.68% to 56.32%

0.7 5.48 44.52% to 55.48%

0.8 4.47 45.53% to 54.47%

0.9 3.16 46.84% to 53.16%

It is clear that the more reliable version of the hypothetical exam is likely to have considerably fewer trainees in the

borderline zone.


Appendix 2: Procedures for using some common methods of standard settingTest based methods

a) Angoff’s method (32)

1) A group of assessors is assembled and briefed. If necessary, this group can be subdivided later into working groups

that should each have at least five members and preferably fewer than ten.

2) In open discussion, the group outlines the characteristics of an imaginary group of borderline trainees - i.e. those with

about a 50/50 chance of passing.

3) Each assessor looks at the first exam item and, independently, estimates the proportion of borderline trainees who

would get the correct answer, for MCQs, etc, or, if the individual assessment is an OSCE station, how many of the

available marks a borderline trainee would get. For example, an assessor might estimate that about 40% of borderline

trainees might get a particular MCQ correct, or that borderline trainees would probably gain about 4 marks out of the

10 available on a particular OSCE station of assessment.

4) The estimates are then discussed and assessors can subsequently change their own estimate if they wish.

5) Assessors’ ‘final’ estimates for the item are collected and averaged to give the ‘provisional’ standard for that item of

assessment.

6) The process is then repeated for each of the remaining individual assessments in the exam.

7) Finally, the sum of the ‘provisional’ standards for each item is calculated and divided by the number of items. This

becomes the pass mark (or standard) for the whole exam, which can subsequently be revised if it becomes clear that it

is too high or too low - assessors usually tend towards setting it too high.

b) Ebel’s method (33) The process begins in the same way as Angoff’s method, with a discussion about the characteristics of a borderline trainee.

However, it then moves in a different direction. Assessors enter each item onto a matrix according to their individual assessments

about relevance (essential, important, supplementary and questionable) and difficulty (difficult, moderate, and easy).

Matrix for Ebel’s method:

Difficult Moderate EasyEssential

Important

Supplementary

Questionable

For example, question 1 might be assessed as essential and difficult; question 2 important and moderate, etc:

Difficult Moderate EasyEssential Q1Important Q2Supplementary

Questionable

In the classical Ebel method, when all the items have been classified in this way, assessors will make an initial estimate of

the number of items in each cell that a borderline trainee would get right.

However, in the Holsgrove and Kauser Ali (2004) modification, having made their individual assessments about the difficulty

and importance of each item of assessment, the assessors first do three extra things:

1) Agree about the classification of the items of assessment in the matrix (using a majority decision if necessary).

2) Scrutinise any assessment items classified as ‘questionable’ and either reclassify them (which would almost invariably

be as ‘supplementary’) or, more frequently, remove them from the exam and replace them with more assessment items

testing more important material.


3) Look at the overall matrix to ensure that:

• the majority of assessment items are of moderate difficulty;

• the majority of assessment items test essential material.

A reasonable final distribution of items would be along the lines of:

Difficult Moderate EasyEssential 10% 35% 10%Important 5% 20% 5%Supplementary 5% 5% 5%Questionable None

The rationale for aiming for such a pattern is threefold:

• unless an assessment item is clearly essential, important or supplementary (i.e. it is trivial or irrelevant), it should not be

in the assessment system as a whole;

• the focus of assessment should be on essential material;

• the most effective items of assessment are good discriminators of moderate difficulty.

It might be necessary to substitute assessment items in order to get this kind of distribution. A well designed and

maintained assessment item bank should have psychometric data on assessments that have been used on previous

occasions. This will include measures of reliability, difficulty and discrimination indices, as well as mapping to the area of

the curriculum that they are assessing.

In the modified version, only after the three steps above have been completed will the assessors move on to estimate the

number of assessment items in each cell that a borderline trainee would get right. Also in the modified version, if the items

of assessment are OSCE stations, the assessors will estimate how many marks a borderline trainee would get for each

station. So if, for example, an assessor estimated that a borderline trainee would score 4 out of 10 on Station 1, and 7 out of

12 on Station 2, this would be recorded on the grid in the following manner:

Difficult Moderate EasyEssential Station 1 (4/10)Important Station 2 (7/12)Supplementary

Questionable None

From this point onwards, the two versions of the Ebel method are the same.

Having made their individual estimates, the assessors discuss the marks for each cell, led by those giving the highest and

lowest estimates in each case. As with the Angoff method, assessors are free to change their estimates as a result of these

discussions. Following the discussions, the proportions (or, in the case of OSCE stations, the marks) assigned by each

assessor are averaged for each of the nine active cells. These averages are then summated to produce the overall standard

(pass mark).

Trainee based methods

Two trainee based methods are described by Downing et al (27) and elsewhere, and are summarised here. Both are based

on assessments about the performance of individual trainees, rather than the content or difficulty of the items themselves.

However, one is based on performance at each individual unit of assessment, the other over the assessment as a whole.

a) Borderline group method

1) Assessors are orientated and briefed about the station individual assessment item they will be assessing and the

checklist and three point rating scale they will be using.

2) Assessors observe each trainee’s performance on their allocated unit of assessment.

3) A global rating (pass/borderline/fail) is made of each trainee, together with a detailed rating on a multiple item

score sheet.

4) The mean scores on the multiple item score sheet for the trainees receiving a global ‘borderline’ rating for the

individual unit of assessment is taken as the passing standard for that unit of assessment.

b) Contrasting groups method


1) Assessors are orientated and briefed about the individual item of assessment they will be examining and the rating

scales they will be using.

2) Assessors observe each trainee’s performance on their allocated unit of assessment.

3) Each trainee is placed into one of the two contrasting groups using a global rating based on external criteria,

performance descriptors and their overall performance in the assessment.

4) The rating scale results for both contrasting groups are represented graphically as curves. The pass mark is

provisionally set where the two curves intersect.

5) The pass mark can subsequently be adjusted in either direction if the provisional mark appears to be unjustly passing

or failing certain trainees.

Combined and compromise methods

Hofstee’s method

This is based on item content and difficulty, but also takes account of agreed parameters regarding the proportions of

passing and failing trainees, and the highest and lowest acceptable pass mark. The description below is based on that of

Case and Swanson (1996).

Based on the assessments about item difficulty, the standard setters estimate the highest score for a trainee to fail and

the lowest score that would allow someone to pass. Once agreed (this is often done by taking median values between

the highest and lowest estimates) these two values are plotted on a graph. For example, the standard setters might agree

that trainees with a mark below 50% should not pass (i.e. the lowest score that would allow someone to pass) and that the

highest acceptable pass mark would be 60%. Therefore, points are entered on the graph along the ‘score’ axis at 50% and

60%.

The standard setters then agree on the highest and lowest acceptable percentages of failing trainees. They might agree, for

example, that a zero failure rate would be acceptable, but that no more than 20% of trainees should be allowed to fail. These,

too, are plotted on the graph, this time along the other axis ‘% Fail’.

The graph now contains a rectangle based on the four agreed values established above - zero and 20% on the ‘% Fail’ axis,

and 50% and 60% on the ‘% Correct score’ axis.

After the examination and calculation of final marks, the trainee’s scores are plotted as a graph of fail rate as a function of

scores obtained.

Finally, a line is drawn from the upper left to lower right corner of the rectangle. Where this line intersects the graph

determines the standard (pass mark).

<45 50 55 60 65% Correct score

70 75 80 850

10

20

30

40

50

60

70

80

90

100

% F

ail


Appendix 3: AoMRC, PMETB and MMC categorisation of assessments

Purpose

This is a system to enable sharing of best practice. Grouping assessments into categories facilitates being able to see what

sort of assessments (and validations of assessments) have already been developed in each category. This may potentially

lead to the development of common competencies across the specialties. However, the principle purpose is to prevent

unnecessary duplication. Categorisation makes it easier to find out what exists, what works and to compare like with like.

The categories

There is a spectrum of methodologies in assessment, from assessing what actually happens in the workplace, through

simulation and OSCEs, to assessment in exam halls. There is also a spectrum of sophistication of the level at which an

assessment may test. Miller described these levels as ‘Knows’, ‘Knows How’, ‘Shows How’ and ‘Does’. This spectrum might

also be labelled ‘Knowledge’, ‘Competence’ and ‘Performance’.

This categorisation attempts to reflect these spectra. Inevitably, this may mean a degree of overlap between categories, but

this should not interfere with the purpose.

This is intended to be a living, evolving system. Constructive comment for future iterations should always be welcomed.

1) A real (medical) patient encounter

• an actual individual patient encounter - e.g. by mini-CEX, mini-ACE;

• a video of patient encounter(s) in the workplace - e.g. in general practice;

• feedback from patients - e.g. ‘patient record’ in general practice.

This may be done in the work setting (e.g. mini-CEX, mini-ACE) or it may be a video of such an encounter, reviewed later.

The feature of this type of assessment is that it is real (and therefore difficult to standardise) and that typically an educational

supervisor might conduct the assessment on a one-to-one basis with the trainee.

However, other assessors may sometimes be used, such as other professionals in the team, or the patients themselves.

2) Direct observation of a skill

• direct observation of a skill - e.g. DOPS, OSATS;

• teaching skills;

• presentation skills.

This category is again assessment of real life activities, where the focus of the assessment is the skill with which the activity

was performed, e.g. technical skills (DOPS, OSATS), teaching skills and presentation skills. The consistent feature is that one

or more assessors, who are trained in the assessment of that skill, make a judgment about a real life performance.

3) Behaviour over time

• MSF - e.g. TAB, mini-PAT.

These assessments ask multiple observers to assess behaviour, typically with regard to generic attributes such as team

working, verbal communication and diligence. They share the feature that they are a collection of retrospective and

subjective opinions of key professionals, based on observation over a period of time. These are usually referred to as MSF

(e.g. mini-PAT, TAB).

4) Behaviour in a real situation or environment

• observation of teamwork - e.g. in psychiatry;

• (simultaneous) multiple actual patient encounter - e.g. in emergency room, labour ward.

The focus of this type of assessment would not be an individual patient, but rather the trainee’s management of the whole

situation. These types of assessments tend to look at behaviours and skills such as teamwork, clinical prioritisation,

‘situational awareness’, clinical leadership, etc. Typically, such assessments are performed in busy acute settings such as a

labour ward or an emergency room.


5) Discussion of clinical materials

• review of a documented incident or of medical records - e.g. case note review;

• discussion of clinical material - e.g. case based discussion, chart simulated recall, video stimulated recall ;

• critical thinking/understanding/evaluation of evidence.

These assessments are usually performed in the absence of patients, on a one-to-one basis - often outside the clinical

setting. There is a large variety of materials which may be used for such discussion, e.g. case notes (CBD), charts (CSR),

videos (VSR), etc. Because this type of assessment is based on actual materials, it is easier to standardise.

6) Simulation

• consultation skills - e.g. with ‘standard patient’ or other role player;

• simulated practical procedure - e.g. on a manikin or a model;

• simulated teamwork exercise - e.g. simulated group discussion;

• simulated situation management - e.g. CPR, A&E, moulage;

• computer simulation.

Sometimes the system requires that skills will be developed which need to be tested more often than the clinical realities

will allow, or which need to be checked before the trainee is allowed to practice them. In such circumstances, the trainee

might be assessed in a simulated setting. Because simulation allows more standardisation, this modality is appropriate

for a variety of areas, such as practical procedures, using models or manikins; communication assessment using standard

patients; teamwork assessment using simulated situations, etc.

Simulations, although having the advantage of reproducibility, have the disadvantage of being less real than real life.

Assessments by computer simulation in some areas are beginning to address these needs.

7) Cognitive assessments

• knowledge - e.g. by invigilated test such as MCQ, EMQ;

• problem solving/higher cognitive assessment/application of knowledge - e.g. CRQ;

• other written assessments.

These share the feature that a group of trainees may be assessed simultaneously, typically by means of a written test such as

MCQ, EMQ, CRQ, etc.

8) Reflective practice

• review of outcomes of care, or of processes undertaken;

• review of trainee-held materials - e.g. file (‘portfolio’) of achievements;

• reflective practice - e.g. reflective diary, written up case, topic or event.

These assessments are not assessments of actual performance, skill or knowledge, but instead they assess the trainee’s

insight when reflecting on these things. Materials on which the trainee might reflect include portfolios, clinical letter writing

(SAIL), CAEs and other case events, etc.


1) A

rea

l (m

edic

al)

pat

ient

en

coun

ter

- e.

g. m

ini-

CE

X

2) D

irec

t ob

serv

atio

n of

ski

ll -

e.g.

D

OPS

, OSA

TS

3) B

ehav

iour

ov

er ti

me

mul

ti-s

ourc

e fe

edb

ack

- e.

g.

TAB,

min

i-PA

T

4) B

ehav

iour

in

a r

eal

situ

atio

n or

en

viro

nmen

t -

e.g.

A&

E,

lab

our

war

d

5) D

iscu

ssio

n of

clin

ical

m

ater

ials

- e

.g.

CBD

, VSR

, CBR

6) S

imul

atio

n -

e.g.

rol

e p

lay,

man

ikin

, co

mp

uter

7) C

ogni

tive

asse

ssm

ents

-

e.g.

MC

Q, C

RQ

8) R

efle

ctiv

e p

ract

ice

- e.

g.

por

tfol

io,

refl

ectiv

e d

iary

Goo

d c

lin

ical

car

e

Clin

ical

as

sess

men

t

Trea

tmen

t

Insi

ght

Rec

ord

kee

pin

g

Fair

ness

Kee

pin

g u

p-t

o-d

ate

Spec

ialt

y b

ased

kn

owle

dg

e

Proc

edur

al a

nd

tech

nica

l ski

lls

Und

erst

and

ing

ev

iden

ce b

ased

m

edic

ine

Appendix 4: Assessment good practice plotted against GMP


1) A

rea

l (m

edic

al)

pat

ient

en

coun

ter

- e.

g. m

ini-

CE

X

2) D

irec

t ob

serv

atio

n of

ski

ll -

e.g.

D

OPS

, OSA

TS

3) B

ehav

iour

ov

er ti

me

mul

ti-s

ourc

e fe

edb

ack

- e.

g.

TAB,

min

i-PA

T

4) B

ehav

iour

in

a r

eal

situ

atio

n or

en

viro

nmen

t -

e.g.

A&

E,

lab

our

war

d

5) D

iscu

ssio

n of

clin

ical

m

ater

ials

- e

.g.

CBD

, VSR

, CBR

6) S

imul

atio

n -

e.g.

rol

e p

lay,

man

ikin

, co

mp

uter

7) C

ogni

tive

asse

ssm

ents

-

e.g.

MC

Q, C

RQ

8) R

efle

ctiv

e p

ract

ice

- e.

g.

por

tfol

io,

refl

ectiv

e d

iary

Goo

d c

lin

ical

car

e

Clin

ical

as

sess

men

t

Trea

tmen

t

Insi

ght

Rec

ord

kee

pin

g

Fair

ness

Kee

pin

g u

p-t

o-d

ate

Spec

ialt

y b

ased

kn

owle

dg

e

Proc

edur

al a

nd

tech

nica

l ski

lls

Und

erst

and

ing

ev

iden

ce b

ased

m

edic

ine

Mai

nta

inin

g a

nd

imp

rovi

ng

per

form

ance

Fold

er

Aud

it &

qua

lity

Pati

ent s

afet

y

Teac

hin

g, tr

ain

ing,

ap

pra

isin

g a

nd

ass

essi

ng

Com

mit

men

t

Und

erst

and

ing

m

edic

al

educ

atio

n p

rinc

iple

s

Prac

tica

l ski

lls

in te

achi

ng,

app

rais

ing

and

as

sess

men

t

Rel

atio

nsh

ip w

ith

pat

ien

ts

Res

pec

t

Com

mun

icat

ion

wit

h p

atie

nts

Chi

ld p

rote

ctio

n

Res

pon

din

g to

p

rob

lem

s

Info

rmed

co

nsen

t

Con

fid

enti

alit

y

Rel

atio

nsh

ip w

ith

pat

ien

ts

Res

pec

t


Glossary of terms

A working paper used by the Workplace Based Assessment Subcommittee of the Postgraduate Medical Education and

Training Board to define their work and documents.

This glossary is based on material from the Tehran University Medical School website (http://www.tums.ac.ir/edc/Glossary.

htm), the MRCP (UK) Glossary of testing terms and contributions from members of the PMETB Workplace Based Assessment

Subcommittee.

Ability The level of successful performance of the objects of measurement on the variable.

Accommodation A change in standard examination conditions which aims to lessen the impact of a trainee’s disability on their performance.

It should not alter the purpose or nature of the examination or provide an unfair advantage to the disabled trainee.

Accreditation A self-regulatory process by which governmental, non-governmental, voluntary associations and other statutory bodies grant

formal recognition to educational institutions or programmes that meet or exceed stated criteria of educational quality.

Achievement test A test designed to measure and quantify a person’s knowledge and/or skill.

Adaptive testing A sequential form of individual testing in which successive items in the test are based primarily on the participant’s

response to previous items.

Anchor item An item with known performance characteristics, which is included in more than one test in order to provide comparative

information about the items in the new version of the test and also about the test takers attempting it.

Angoff method A method of standard setting (discussed in more detail in the PMETB Standard Setting document) based on group

judgments about the performance of hypothetical borderline (‘just passing’) trainees,

Appeal Formal request to the awarding body for reconsideration of a decision (commonly the pass/fail decision).

Appraisal An individual and private planned review of progress focusing on achievements and future activities.

Assessment The process of measuring an individual’s progress and accomplishments against defined standards and criteria, which

often includes an attempt at measurement. The purpose of assessment in an educational context is to make a judgment

about mastery of skills or knowledge; to measure improvement over time; to arrive at some definitions of strengths and

weaknesses; to rank people for selection or exclusion, or perhaps to motivate them. Assessment should be as objective

and reproducible as possible. A reliable test should produce the same or similar score on two occasions or if given by

two assessors. The validity of a test is determined by the extent to which it measures what it sets out to measure. There are

different kinds of assessment, though the distinction between the first two - formative and summative - is now becoming

less important because of the development of assessment programmes, the use of evidence from assessments for multiple

purposes and the increasingly common practice of providing feedback following all assessments.

• Formative assessment is used as part of a developmental or ongoing teaching/learning process. It is a check on progress

that does not contribute to pass/fail decisions, but informs teachers and learners about strengths, weaknesses and any

problem areas. It is best used when accompanied by feedback to the student.

• Summative assessment traditionally takes the form of tests and often occurs at the end of a term or a course. However,

especially in postgraduate medical education, other sources of evidence are increasingly contributing to summative

assessment. Summative assessment is used primarily to provide information about whether or not the student has reached

the required standard and it can form the basis of pass/fail decisions.

• Criterion referenced assessment refers to an absolute standard, i.e. the individual's performance against a benchmark.

Unless there is a particular reason not to do so (such as a limited number of places for students who pass) all summative

assessments should be criterion referenced.


• Norm referenced assessment ranks each student’s performance against all the others in the same cohort, with a (usually)

predetermined number of the top students passing. Norm referenced assessment is inherently unfair because a student

may pass or fail simply because of the company they keep. Unless there is an exceptional reason, such as a limited

number of places for successful students to progress into, norm referenced assessment should not be used.

Assessment: 360-degree Can be used to assess interpersonal and communication skills, professional behaviours and many aspects of patient

care and systems based practice. Assessors completing rating forms in a 360-degree evaluation are usually a mixture

of superiors, peers, subordinates, other team members, patients and their families. Most 360-degree assessments use a

structured questionnaire to gather information about an individual’s performance in several domains such as teamwork,

communication, leadership and management skills, decision making, etc. This is a useful instrument for both formative and

summative assessment, and an excellent source of evidence on which to give feedback to the person who was assessed.

Assessment programmes Contemporary best practice favours assessment strategies that are multi-faceted and assess an appropriate spectrum of

knowledge, skills, competencies and personal attributes in an adequately reliable way. Such a programme of assessment is

based upon and determined by the curriculum (see PMETB Principles of assessment).

Bias Systemic variance that skews the accurate reporting of data in favour of, or against, a particular individual or group.

Blueprint A template used to define the content of a given test. In medical education, it is often designed as a matrix or a series of

matrices.

CanMEDS Canadian Medical Education Directives for Specialists - an innovative framework for medical education produced by the

Royal College of Physicians and Surgeons of Canada, organised around seven key roles: Medical Expert (the central role),

Communicator, Collaborator, Health Advocate, Manager, Scholar and Professional.

Competency The knowledge, skill, attitude or combination of these, that enables one to effectively perform the activities of a particular

occupation or role to the standards expected.

Certification The process by which governmental, non-governmental or professional organisations or other statutory bodies grant

recognition to an individual who has met certain predetermined standards specified by the organisation and who

voluntarily seeks such recognition.

Chart stimulated recall oral examination (CSR) A measurement tool which permits the assessment of clinical decision making and the application of medical knowledge

with real patients using a standardised oral examination. A trained and experienced physician examiner questions

the examinee about the provided care, probing for reasons behind the differential diagnoses, interpretation of clinical

findings and management plans. The examiners rate the examinee using an established protocol and scoring procedure.

CSR can be used in a formal situation where trainees discuss a number of cases, but are increasingly proving to be a

valuable instrument in workplace based assessment where they tend to be carried out on a case-by-case basis, or even

opportunistically. Current research is also being carried out in using CBR in multi-disciplinary learning (contact Dr Gareth

Holsgrove - [email protected]).

Clinical competence A student’s ability to do what is expected at a satisfactory level of facility, at a certain point in time, e.g. at graduation. It is

the acquisition of a body of relevant knowledge and of a range of relevant skills which includes personal, interpersonal,

clinical and technical components. In the case of clinical education, which is primarily based on an apprenticeship model,

teachers define what the student is expected to do and then test their ability to do it. However, because of the complex

reality of what doctors actually do on a day-to-day basis, ‘clinical competence’ gives us a rather limited view of their work,

professional experience and expertise. Most clinical actions are concerned with problems for which there is no clear

answer or no single solution and where no two patients are the same, even if they have the same condition. An experienced

doctor searches his or her mind and sifts through a wide range of options and in some cases the solution will be something

he or she has never come up with before. Therefore, competence itself is best seen as a prerequisite for performance in the

real clinical setting where it would be expected that a doctor operated at a higher level in many areas and demonstrated

mastery in some.


Clinical supervisor A term used in UK postgraduate medical education to describe an individual, almost invariably a senior doctor, responsible

for overseeing a trainee’s clinical work and providing guidance and feedback.

Communication skills These skills lead to proficiency in communication - an essential skill for clinical practitioners because of the large and

varied number of people doctors must communicate with every day and the range of circumstances, some of which

might be very distressing, in which they must communicate. The idea that doctors automatically learn communication

through experience or that doctors are inherently either good or bad communicators is long abandoned. It is now widely

acknowledged that both students and postgraduate doctors can be educated in communication skills and their proficiency

can develop to extremely high levels of expertise.

Competence The possession of requisite or adequate ability, having acquired the knowledge and skills necessary to perform those

tasks that reflect the scope of professional practices. It may be different from performance, which denotes what someone is

actually doing in a real life situation.

Competencies A set of professional abilities that includes elements of knowledge, skill, attitudes and experience.

Construct A specific professional concept. See ‘Construct validity’ below, under ‘Validity’.

Correlation coefficient Describes the strength of the relationship between two variables. Correlations range from -1.0 to +1.0 in value. A correlation

coefficient of 1.0 indicates a perfect positive relationship, a correlation coefficient of 0.0 indicates no relationship between

the two variables and a correlation coefficient of -1.0 indicates a perfect negative relationship.

If a correlation coefficient is squared, the resulting number indicates the ‘percentage of the variation’ in the two variables

that is in common. For example, a correlation of 0.7 between two variables indicates that 49% (0.7 x 0.7 = 0.49) of the

variation in one variable can be predicted from the other.

In a test, items should correlate moderately positively with each other. Negative correlations indicate that the items are

either testing material from different domains, or that at least one of them is flawed. Strongly positive correlations indicate

that they are testing more or less the same thing (because most of the variation in one can be predicted from variation in the

other, as explained in the preceding paragraph).

CPD Continuing Professional Development - a key aspect of life-long learning, CPD refers to the learning activities that doctors

undertake after their formal specialist training is complete.

Criterion referencing Criterion referenced assessment measures performance against an absolute standard. In other words, each trainee’s

performance against a benchmark (usually the pass mark).

Cronbach’s alpha The most commonly measured aspect of reliability of a test - internal consistency. It is an average of all possible split half

reliability measurements. The generally accepted minimum value of Cronbach’s alpha for a test is 0.8, but for high stakes

examinations it should be at least 0.9.

CRQ Critical Reading Question - an assessment based on responses to questions regarding an article (often a research article)

from a book or (more commonly) journal. The questions might ask about research methodology, robustness of conclusions,

clinical implications, etc.

Curriculum A curriculum is a statement of the aims and intended learning outcomes of an educational programme. It states the

rationale, content, organisation, processes and methods of teaching, learning, assessment, supervision and feedback. If

appropriate, it will also stipulate the entry criteria and duration of the programme.

Discriminator An item that discriminates well between weaker and stronger test takers, stronger trainees performing statistically better

than weaker ones.


Distractor A term, becoming obsolete, for incorrect options in a multiple choice question.

Domain The scope of knowledge, skills, competencies and professional characteristics that can be combined for practical reasons

into one cluster.

Educational agreement A mutually acceptable educational development plan drawn up jointly by the trainee and their educational supervisor.

Educational supervisor The person who is responsible for the overall supervision and management of an individual student or trainee’s educational

programme.

Evaluation In the UK curriculum, evaluation refers to the process of determining the quality and value of an educational programme. In

US usage, evaluation includes both the quality of the programme and the assessment of individuals on the programme. In

the UK, ‘assessment’ is used of individuals and ‘evaluation’ of programmes.

Evidence based medical education (EBME) An education that is based on the best evidence available. It should take into account such factors as how reliable the

available evidence is, what its utility, extent and strength is, and how valid and relevant it is.

Examination A formal, controlled method or procedure to access an individual’s knowledge, skills and abilities. Examinations might

involve written or oral responses, or observation of the trainee performing practical tasks.

Examiner A person appropriately skilled, experienced and trained to conduct examinations.

Experience Exposure to a range of medical practice and clinical activity.

Extended matching questions (EMQs) A more detailed form of multiple choice question (MCQ) having a lead-in statement such as a clinical vignette, followed by

a homologous list of at least five options from which the trainee selects one or more, as instructed.

Facility A statistical property indicating the level of difficulty of a question (between 0.0 and 1.0) obtained from the average score for

the question divided by the maximum achievable score, based only on the cohort of trainees attempting that particular item.

Fail Awarded a score below the pass mark.

Formative assessment Assessment carried out for the purpose of improvement rather than pass/fail decision making. The distinction between

formative and summative - decision making - assessment is becoming less important as evidence from assessment is

increasingly being used for multiple purposes.

Generalisability theory An extension of classical reliability theory and methodology that is now becoming the preferred option. A multiple analysis

of variance is used to indicate the magnitude of errors from various specified sources, such as the number of items in the

test, whether marking is carried out by one examiner or more, etc. The analysis is used both to indicate the reliability of the

test and to evaluate the generalisability of scores beyond the specific sample of items, persons and observational conditions

that were studied.

Goal A general aim towards which to strive.

Hofstee method A ‘compromise’ method of standard setting which combines aspects of both relative and absolute methods. It takes account of

both the difficulty of the test items and of the maximum and minimum acceptable failure rate for the exam, and was designed

for use in high stakes examinations with a large number of trainees. Discussed in more detail in the PMETB Principles for an

assessment system for postgraduate medical training (1).


Intended learning outcome The contemporary replacement for learning objectives (see below) which describes (typically in observable terms) the

knowledge, skills, attitudes and competencies that should be demonstrable on completion of a learning episode.

In-training An adjective used in UK medical education to describe ongoing processes that occur in the workplace - for example, in-

training assessment would refer to collecting evidence of progress and attainment over an extended period of time, usually

with regular staging reviews.

Item An individual question or task in an assessment or examination.

Item bank A collection of stored, classified examination items.

Item response theory (IRT) A set of mathematical models for relating an individual’s performance in a test to that individual’s level of ability. These

models are based on the fundamental theory that an individual’s expected performance on a particular test item is a

function of both the level of difficulty of the item and the individual’s level of ability. IRT also examines individual items

in relation to each other, and to the test as a whole, quantifying such characteristics as item difficulty and their ability to

discriminate between good and poor trainees.

Knowledge The acquisition or awareness of facts, data, information, ideas or principles to which one has access through formal or

individual study, research, observation experience or intuition.

Learning objective A term that is now becoming obsolete, learning objectives describe the specific knowledge or skills which learners are

expected to be able to demonstrate. Rules governing the writing of learning objectives make them difficult to produce and,

as a result, comparatively few items described as learning objectives actually are proper learning objectives. The preferred

alternative is intended learning outcome (see above).

Life-long learning Continuous personal educational development over the course of a professional career. Because medical science changes

so rapidly, it is vital that its practitioners are committed to and engage in life-long learning. In certain countries this

commitment is a statutory requirement.

Measurement error The difference between the ‘true’ score and the score obtained in an assessment. Measurement error is present in all

assessments, but can be minimised by good item design and, up to a point, by increasing the number of test items. It is

usually calculated as the Standard Error of Measurement.

Medical education The ongoing integration of knowledge, experience, skills, qualities, responsibility and values. It has traditionally been

divided into undergraduate, postgraduate and continuing medical education, but increasingly there is a focus on the ‘life-

long’ developmental and integrated nature of medical education.

Medical educator A professional who focuses on the educational process necessary to transform non-physicians into physicians and to keep

them current over their years of practice. Some medical educators are physicians, but many have backgrounds in education,

behavioural science or other health sciences.

Medical Informatics “Medical informatics is a rapidly developing scientific field that deals with the storage, retrieval and optimal use

of biomedical information, data and knowledge for problem solving and decision making” (E. H. Shortlife). Rapid

development is due to advances in computing, communication technology and an increasing awareness that the knowledge

base of medicine is essentially unmanageable by traditional paper based methods.

MSF Multi-Source Feedback - feedback on a doctors’ performance from a number of co-workers such as other team members,

administrative staff, etc. This feedback is typically given by completing a questionnaire. Replies are collated and an

anonymised summary is produced for feedback.


Multiple choice An item where the trainee selects what they consider to be the correct answer from a list of options. Commonly used in

MCQs (multiple choice questions) and EMQs (extended matching questions).

Multiple choice questions (MCQs) A lead-in statement (typically a short clinical description) followed by a homologous list of options (five is generally

considered the optimum) from which the trainee selects the best answer.

Multiple response questions Apart from some types of extended matching questions (EMQs), this is an obsolete question format where trainees select

various combinations of the proffered options as their correct answer.

Norm referencing A method of establishing passing and failing trainees based on their performance in relation to each other, rather than to

an established standard (criterion referencing). So for example, only the top n number or x% of trainees pass, irrespective

of how strong or weak the cohort is as a whole. Norm referencing should be used only in certain special circumstances,

for example where there is a limited number of posts available for successful trainees to move on to. See also above under

‘Assessment’.

OSCE Objective Structured Clinical Examination - a multi-station clinical examination (typically having 15 to 25 stations).

Candidates spend a designated time (usually 5 to 10 minutes) at each station demonstrating a clinical skill or competency

at each. Stations frequently feature real or (more often) simulated patients. Artefacts such as radiographs, lab reports and

photographs are also commonly used.

Outcomes An expression reflecting all possible results that may stem from exposure to a causal factor or activity. In education,

outcomes are part of the training model and this is usually a new skill, knowledge or stimulus to change (improve) practice.

Educational models work from the premise that the outcomes cannot wholly be predicted. Recently, there is a growing

tendency to use the expression ‘intended learning outcomes’, particularly in curriculum design.

Pass To achieve a score (mark) that allows progress in training or successful completion of an examination.

Pass mark The score that allows a trainee to pass an assessment.

Peers review This is an important tool in obtaining evidence about professional attitudes and behaviour. It can be carried out by trainees

to assess each other and is also used by supervisors, nurses and patients to evaluate trainees. It is an important component

of 360-degree assessment.

Performance The application of competence in real life. In the case of medicine, it denotes what a student or doctor actually does in

his/her encounter with patients, their relatives and carers, colleagues, team members and other members of staff, etc.

Performance is not the same as needing to ‘know’ everything. On the contrary, it may well be about knowing what you don’t

or even cannot know - in other words, knowing your own limits.

Performance based assessment Assessment of clinical performance is of the greatest importance but is difficult to measure. However, because of its

importance, considerable efforts and resources are now being brought to bear on the assessment of doctors’ performance

and, since this performance is carried out and observed in the workplace, instruments to assess it are predominantly

workplace based. Performance based assessments, carried out in the workplace, are likely to be among the most important

developments in medical education over the next few years.

Personal development plan (PDP) A prioritised list of educational needs, development goals, actions and processes, compiled by learners and used in

systematic management and periodic reviews of learning. The PDP is an integral part of reflective practice and self-directed

learning for professionals.

Portfolio based learning or portfolios This refers to a collection of evidence documenting learning and achievements. In the UK, portfolios are used routinely in

postgraduate medical education where they are known as RITAs (Records of in-training assessment).


In essence, portfolios contain material collected by the learner over a period of time, which is the learner’s practical and

intellectual property relating to their professional development. It is usually done within some agreed objectives or a

negotiated set of learning activities. Some portfolios are developed in order to demonstrate the progression of learning,

while others are assessed against specific targets of achievement. The learner takes responsibility for the portfolio’s

creation and maintenance, and, if appropriate, its presentation for assessment. As it is based on the real learner’s

experience, it links theory and practice and might also usefully include a reflective element. It is assembling evidence

of performance from different sources and enables an assessment within a framework of established clear criteria and

learning outcomes.

Professionalism Adherence to a set of values comprising statutory professional obligations, formally agreed codes of conduct and the

informal expectations of patients and colleagues. Key values include acting in the patients’ best interest and maintaining

the standards of competence and knowledge expected of members of highly trained professions. These standards will

include ethical elements such as integrity, probity, accountability, duty and honour. In addition to medical knowledge and

skills, medical professionals should present psychosocial and humanistic qualities such as caring, empathy, humility and

compassion, social responsibility and sensitivity to people’s culture and beliefs.

Programme director The person with overall day-to-day responsibility for a regional (usually deanery level) postgraduate training programme

Quality assurance This encompasses all the policies, standards, systems and processes directed to ensuring maintenance and enhancement of

the quality of postgraduate medical education in the UK. PMETB will undertake planned and systematic activities to provide

public and patient confidence that postgraduate medical education satisfies given requirements for quality within the

principles of better regulation.

Quality management This refers to the arrangements by which the Postgraduate Deanery discharges its responsibility for the standards and

quality of postgraduate medical education. It satisfies itself that local education and training providers are meeting the

PMETB standards through robust reporting and monitoring mechanisms.

Quality control This relates to the arrangements (procedures, organisation) within local education providers (Health Boards, NHS Trusts,

Independent sectors) that ensure postgraduate medical trainees receive education and training that meets local, national

and professional standards.

Raw score A test mark that has not been modified (for example, in the light of reliability calculations).

Reliability Expresses a trust in the accuracy or provision of the correct results. In the case of tests, it is an expression of the consistency

and reproducibility (precision) of measurements. This quality is usually calculated statistically and reported as coefficient

alpha (also known as Cronbach’s alpha in recognition of its developer, Lee Cronbach), which is a measure of a test’s internal

consistency. If a single measure of the reliability of an assessment instrument is made, it should be this one. However,

generalisability theory (see above) is becoming the preferred alternative because, although it is considerably more

complicated to calculate, it provides much richer information - especially for test development purposes.

The lowest acceptable value of Cronbach’s alpha in summative assessments is generally agreed to be 0.8. High stakes

assessments must have a higher alpha than this and there is general consensus among test developers that the benchmark

in high stakes examinations should be 0.9.

There are some other important dimensions of reliability. Ideally, measurements should yield the same results when

repeated by the same person or made by the different assessors. Among the factors contributing to reliability are the

consistency of marking, the quality of the test and test items themselves, and the type and size of the sample. On top

of this, there will also be a component of random error. Other measures of reliability include stability, equivalence and

homogeneity. The main dimensions of reliability, apart from internal consistency, are as follows:

• Equivalence or alternate form reliability is the degree to which alternate forms of the same measurement instrument

produce congruent result.

• Homogeneity is the extent to which various items legitimately team together to measure a single characteristic.

• Inter-rater reliability refers to the extent to which different raters give similar ratings for similar performances.

• Intra-rater reliability is concerned with the extent to which a single assessor would give similar marks for almost identical


performance, or would be consistent if re-marking a test item.

• Parallel forms reliability refers to the consistency of results between two or more forms of the same assessment, testing

in the same domain. This, along with test-retest reliability to which it is closely associated, is highly significant in

examinations such as Royal College membership and Fellowship examinations, and involves such concepts as producing

matching items, standard setting and examiner training.

• Test-retest reliability (or stability) is the degree to which the same test produces the same results when repeated under

the same conditions.

Result The outcome of a test.

Review Consideration of past events, achievements and performance. This may be either a formal or informal process and can be

an integral part of appraisal, assessment, and feedback.

RITA Record of in-training assessments. A portfolio of assessments that are carried out during training, which is used throughout

UK postgraduate medical education. It is important to note that RITA is not an assessment in its own right, nor is it a review of

progress although it is likely to be used as a source of evidence, gained through assessment, that informs reviews.

SAIL Sheffield Assessment Instrument for Letters - a structured rating form for the assessment of outpatient letters between

hospital and GP.

Score The mark obtained in a test.

Self assessment A process of evaluation of one’s own achievements, behaviour, professional performance and competencies. Self

assessment is an important part of self directed and life-long learning.

Simulated patients Individuals who are not ill but adopt a patient’s history and role for learning or assessment in medical education. Sometimes

programmes use actors to accomplish this goal.

Skill The ability to perform a task well usually gained by training or experience, or a systematic and co-ordinated pattern of

mental and/or physical activity.

Spearman-Brown formula A calculation derived from classical test theory that predicts the reliability of shortened or (usually) lengthened versions of

a test, based on the reliability calculated from a version of that same test of a specific length.

Standard Refers to a model, example or rule for the measure of quantity, weight, extent, value or quality, established by authority,

custom or general consent. It is also defined as a ‘criterion, gauge, yardstick and touchstone’ by which judgments or

decisions may be made. Thus the word ‘standard’ refers simultaneously to both ‘model and example’ and ‘criterion or

yardstick’ for determining how well one’s performance approximates the designed model. Meaningful standards should

offer a realistic prospect of assessing whether or not they are met.

A standard may be mandatory (required by law), voluntary (established by private and professional organisations and

available for use), or de facto (generally accepted by custom or convention, such as the standard of dress, manners or

behaviour).

Standard deviation The square root of the variance, used to indicate the spread of group scores and a component of the equation to calculate

Standard Error of Measurement (SEM)

Standard error of measurement (SEM) Calculated from Cronbach’s alpha and the standard deviation of a test (SEM = SD √ (1 – alpha)), the SEM gives the

confidence intervals for marks awarded to trainees (1 SEM = a confidence interval of 68%; 2 SEMs = 95%; 3 SEMs = 99%).

This is important in identifying borderline trainees. In high stakes examinations, borderline trainees would be those within 2

or even 3 SEMs of the pass mark.


Standard setting The process of establishing the pass mark.

Standards In medical education standards may be defined as ‘a model design or formulations related to different aspects of medical

education and presented in such a way to make possible assessment of graduates’ performance in compliance with

generally accepted professional requirements’. Thus a standard is both a goal (what should be done) and a measure of

progress towards that goal (how well it was done). Medical education standards are set up by consent of experts or by

decisions of an educational authority. Three types of interrelated educational standards might be envisaged:

• Curriculum standards - these describe skills, knowledge, attitudes and values, what teachers are supposed to teach and

what students are expected to learn. There might also be ‘essential (core) requirements’ that the medical curriculum must

meet to equip physicians with the knowledge, skills and attitudes required at the time of the graduation.

• Performance or assessment standards - these standards define degrees of attainment of content standards and level of

competencies in compliance with the professional requirements. They describe how well the curriculum standards have

been attained.

• Process (or opportunity-to-learn) standards - these define availability of staff and other resources necessary for students

to be able to meet the content and performance standards.

Summative assessment Assessment carried out for the purpose of (usually pass/fail) decision making. The distinction between formative

assessment (to aid improvement) and summative assessment (for decision making) is becoming less important as evidence

from assessment is increasingly being used for multiple purposes. See also ‘Assessment’ above.

Syllabus A list, or some other kind of summary description, of course contents or topics that might be tested in examinations. In

modern medical education, a detailed curriculum is the document of choice and the syllabus would not be regarded as an

adequate substitute, although one might usefully be included as an appendix.

Trainer An individual providing direct educational support for a doctor in training.

Training The ongoing, workplace based process by which educational experience is provided and competencies obtained.

Triangulation The principle - particularly important in workplace based assessment - that whenever possible, evidence of progress,

attainment or difficulties should be obtained from more than one source, on more than one occasion and, if possible, using

more than one assessment method.

True score A trainee’s score on a test without measurement error - true score is the observed score minus the error.

Utility Utility refers to an evaluation, often in cost benefit form, of the relative value of using a test, as opposed to not using it; or of

using a test in one manner compared with another; or of using one test as opposed to another.

Validity In the case of assessment, validity refers to the degree to which a measurement instrument truly measures what it is

supposed to measure. It is concerned with whether the right things are being assessed, in the right way, and with a positive

influence of learning.

• Content validity

This is concerned with sampling what the student is expected to achieve and demonstrate. The assessment must be

representative and should, for example, cover several categories of competence, a range of patient problems and a

number of technical skills. This aspect of validity is the one of greatest concern to the teachers, though they should also

pay serious attention to consequential validity.

• Face validity

Related to content validity, face validity can be described from the perspective of an interested lay observer. If they feel

that the right things are being assessed in the right way, then the assessment has good face validity.

• Construct validity

The extent to which the assessment, and the individual components of the assessment, tests the professional constructs on


which they are based; the extent to which inferences can be made on the basis of a particular assessment of professional

concepts.

• Criterion-related validity

This is concerned with the overall criteria of the assessment and how it relates to a ‘gold standard’. It is usually sub-

divided into concurrent validity and predictive validity.

• Concurrent validity

This is the degree to which a measurement instrument produces the same results as another accepted or proven

instrument that measures the same parameters.

• Predictive validity

This refers to the degree to which a measure accurately predicts expected outcomes, so, for example, a measure of

attitudes towards preventive care should correlate significantly with preventive care behaviours.

• Consequential validity

This is an important, though often neglected, aspect of the validity of assessment. It refers to the effect that assessment has

on learning and in particular on what students learn and how they learn it. For example, they might omit certain aspects

of the curriculum because they do not expect to be assessed on them, or they might commit large bodies of factual

knowledge to memory without really understanding it in order to pass a test of factual recall and then forget it soon

afterwards. Both these behaviours would indicate that the assessment has poor consequential validity because both lead

to bad learning practices.

Values This is a sociological term referring to what we believe in and what we hold dear about the way we live. Our values

influence our behaviour as persons, groups, communities and cultures - perhaps as a species. Values, therefore, are an

important determinant of individual and community health, but they are difficult to measure objectively.

Weighting Assigning different values to different items, reflecting, for example, their importance or difficulty in order to increase the

effectiveness of a test.

Workplace based assessment The assessment of working practices based on what they actually do in the workplace and predominantly carried out in the

workplace itself.

Z-score The z-score for an item indicates how far, and in what direction, an individual trainee’s score deviates from the mean

distribution of that item. It is expressed in units of its standard deviation. The z-score transformation is useful to compare the

relative standings of items from distributions with different means and/or different standard deviations.

Postgraduate Medical Education

and Training Board

Hercules House

Hercules Road

London SE1 7DU

Tel +44 (0)20 7160 6100

Fax +44 (0)20 7160 6102

www.pmetb.org.uk

developing and maintaining an assessment system

Documents