validity and reliability in assessment

Validity and Reliability in Assessment

This work is the summarizations Of the previous efforts done by great educatorsA humble presentation by Dr Tarek Tawfik Amin

Measurement experts (and many educators) believe that every measurement device should possess certain qualities

The two most common technical concepts in measurement are reliability and validity

Reliability Definition (Consistency)

The degree of consistency between two measures of the same thing (Mehrens and Lehman 1987(

The measure of how stable dependable trustworthy and consistent a test is in measuring the same thing each time (Worthen

et al 1993)

Validity definition (Accuracy)

Truthfulness Does the test measure what it purports to measure the extent to which certain inferences can be made from test scores or other measurement (Mehrens and Lehman 1987)

The degree to which they accomplish the purpose for which they are being used (Worthen et al 1993(

The usual concepts of validity

The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling

ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be

concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already

validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance

Old concept

Sources of validity in assessment

Usual concepts of validity


All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)

Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)

Downing 2003 Cook S 2007

Content do instrument items completely represent the construct

Response process the relationship between the intended construct and the thought processes of subjects or observers

Internal structure acceptable reliability and factor structure

Relations to other variables correlation with scores from another instrument assessing the same construct

Consequences do scores really make a difference



Content Response process

Internal structure

Relationship to other variables

Consequences

- Examination blueprint

- Representativeness of test blueprint to achievement

domain

- Test specification

- Match of item content to test specifications

- Representativeness of items to domain

- Logicalempirical

relationship of content tested domain

- Quality of test questions

- Item writer qualifications

- Sensitivity review

- Student format familiarity

- Quality control of electronic

scanningscoring

- Key validation of preliminary scores

- Accuracy in combining different formats scores

- Quality controlaccuracy of final scoresmarksgrades

- Subscoresubscale analyses

1-Accuracy of applying pass-fail decision rules to scores

2-Quality control of score reporting

bull Item analysis data

1 Item difficultydiscrimination

2 Itemtest characteristic curves

3 Inter-item correlations

4 Item-total correlations (PBS)

bull Score scale reliability

bull Standard errors of

measurement (SEM)

bull Generalizability

bull Item factor analysis

bull Differential Item

Functioning (DIF)

bull Correlation with other relevant variables (exams)

bull Convergent correlations -

internalexternal

- Similar tests

bull Divergent correlations internalexternal

- Dissimilar measures

bull Test-criterion correlations

bull Generalizability of evidence

bull Impact of test scoresresults

on studentssociety

bull Consequences on learnersfuture

learning

bull Reasonableness of method of establishing pass-fail (cut) score

bull Pass-fail consequences

1 PF Decision reliability-accuracy

2 Conditional standard error of measurement

bull False +ve-ve

Sources of validity 1-Internal Structure

Statistical evidence of the hypothesized relationship between test item scores and

the construct1 -Reliability (internal consistency)

1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability

2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations

3- Scale factor structure4 -Dimensionality studies

5 -Differential item functioning (DIF) studies

Sources of validity

2-Relationship to other variables

Statistical evidence of the hypothesized relationship between test scores and the

construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies

Keys of reliability assessment

ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo

criterion ldquoIntra-raterrdquo related to the examinerrsquos

criterion

Validity and reliability are closely related

A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid




Sources of reliability in assessment

Source of reliability

Description Measures Definitions Comments

Internal consistency

- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well

- We would expect high correlation between item scores measuring a single construct

- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument

- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability

Split-half reliability

Kuder-Richardson 20

Cronbachrsquos alpha

- Correlation between scores on the first and second halves of a given instrument

- Similar concept to split-half but accounts for all items

- A generalized form of the

Kuder-Richardson formulas

- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust

-Assumes all items are equivalent measure a single construct and have dichotomous responses

- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data




Temporal stability

Parallel forms

Agreement (inter-rater

reliability)

Does the instrument produce similar results when administered a second

time

Do different versions of the ldquosamerdquo instrument produce similar results

When using raters does it matter who does the rating

Is one raterrsquos score similar to anotherrsquos

Test-retest reliability

Alternate forms reliability

Percent agreement

Phi

Kappa

Kendallrsquos tau

Intraclass correlation coefficient

Administer the instrument to the same person at different times

Administer different versions of the instrument to the same individual at the same or

different times

identical responses

Simple correlation

Agreement corrected for chance

Agreement on ranked data

ANOVA to estimate how well ratings from different raters coincide

Usually quantified using correlation (eg Pearsonrsquos r)


Does not account for agreement that would occur by chance

Does not account for chance




Generalizability theory

How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process

Generalizability coefficient

Complex model that allows estimation of multiple sources of error

As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed

For example it can determine the relative contribution of internal consistency and

inter-rater reliability to the overall reliability of a given instrument

ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)

Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)


Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory


Different types of assessments require different kinds of reliability

Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory


Reliability ndash How high

1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)


How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings

Conclusion

Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower

References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method

for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation

of a widely disseminated educational framework for evaluating

clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-

Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia

Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching

A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported

patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA

2003290953-958

Referen

ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319

Resources

For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar

studentsreportitemanalysisphp For a more extensive list of item-writing tips

httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf

httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf

For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf

woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25

Measurement experts (and many educators) believe that every measurement device should possess certain qualities

The two most common technical concepts in measurement are reliability and validity




et al 1993)









Old concept















Internal structure


Consequences



domain




- Logicalempirical







scanningscoring














measurement (SEM)




Functioning (DIF)



internalexternal

- Similar tests






on studentssociety


learning





bull False +ve-ve








Sources of validity







criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25




et al 1993)









Old concept















Internal structure


Consequences



domain




- Logicalempirical







scanningscoring














measurement (SEM)




Functioning (DIF)



internalexternal

- Similar tests






on studentssociety


learning





bull False +ve-ve








Sources of validity







criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25









Old concept















Internal structure


Consequences



domain




- Logicalempirical







scanningscoring














measurement (SEM)




Functioning (DIF)



internalexternal

- Similar tests






on studentssociety


learning





bull False +ve-ve








Sources of validity







criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25














Internal structure


Consequences



domain




- Logicalempirical







scanningscoring














measurement (SEM)




Functioning (DIF)



internalexternal

- Similar tests






on studentssociety


learning





bull False +ve-ve








Sources of validity







criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25








Sources of validity







criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25




criterion















Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25













Kuder-Richardson 20












Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25




Temporal stability

Parallel forms


reliability)


time






Percent agreement

Phi

Kappa

Kendallrsquos tau




different times

identical responses

Simple correlation






























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25























Conclusion










2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25









2003290953-958

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25

Referen


Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25

Resources






woodfordpdf


Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

References

Slide 25

validity and reliability in assessment

Education

ann intern

american psychological

med educ

american council

assessment

hre ns

considered

assessment