validity and reliability in assessment
DESCRIPTION
Describes the essential components of reliability and validity of the assessment methods with special emphasis on medical education.TRANSCRIPT
Validity and Reliability in Assessment
This work is the summarizations Of the previous efforts done by great educatorsA humble presentation by Dr Tarek Tawfik Amin
Measurement experts (and many educators) believe that every measurement device should possess certain qualities
The two most common technical concepts in measurement are reliability and validity
Reliability Definition (Consistency)
The degree of consistency between two measures of the same thing (Mehrens and Lehman 1987(
The measure of how stable dependable trustworthy and consistent a test is in measuring the same thing each time (Worthen
et al 1993)
Validity definition (Accuracy)
Truthfulness Does the test measure what it purports to measure the extent to which certain inferences can be made from test scores or other measurement (Mehrens and Lehman 1987)
The degree to which they accomplish the purpose for which they are being used (Worthen et al 1993(
The usual concepts of validity
The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling
ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be
concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already
validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Measurement experts (and many educators) believe that every measurement device should possess certain qualities
The two most common technical concepts in measurement are reliability and validity
Reliability Definition (Consistency)
The degree of consistency between two measures of the same thing (Mehrens and Lehman 1987(
The measure of how stable dependable trustworthy and consistent a test is in measuring the same thing each time (Worthen
et al 1993)
Validity definition (Accuracy)
Truthfulness Does the test measure what it purports to measure the extent to which certain inferences can be made from test scores or other measurement (Mehrens and Lehman 1987)
The degree to which they accomplish the purpose for which they are being used (Worthen et al 1993(
The usual concepts of validity
The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling
ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be
concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already
validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Reliability Definition (Consistency)
The degree of consistency between two measures of the same thing (Mehrens and Lehman 1987(
The measure of how stable dependable trustworthy and consistent a test is in measuring the same thing each time (Worthen
et al 1993)
Validity definition (Accuracy)
Truthfulness Does the test measure what it purports to measure the extent to which certain inferences can be made from test scores or other measurement (Mehrens and Lehman 1987)
The degree to which they accomplish the purpose for which they are being used (Worthen et al 1993(
The usual concepts of validity
The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling
ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be
concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already
validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Validity definition (Accuracy)
Truthfulness Does the test measure what it purports to measure the extent to which certain inferences can be made from test scores or other measurement (Mehrens and Lehman 1987)
The degree to which they accomplish the purpose for which they are being used (Worthen et al 1993(
The usual concepts of validity
The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling
ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be
concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already
validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
The usual concepts of validity
The term ldquovalidityrdquo refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are ldquowell-grounded or justifiable being at once relevant and meaningfulrdquo (Messick S 1995) Contentrdquo related to objectives and their sampling
ldquoConstructrdquo referring to the theory underlying the target ldquoCriterionrdquo related to concrete criteria in the real world It can be
concurrent or predictive ldquoConcurrentrdquo correlating high with another measure already
validated ldquoPredictiverdquo Capable of anticipating some later measure ldquoFacerdquo related to the test overall appearance
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Old concept
Sources of validity in assessment
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Usual concepts of validity
Sources of validity in assessment
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
All assessments in medical education require evidence of validity to be interpreted meaningfully In contemporary usage all validity is construct validity which requires multiple sources of evidence construct validity is the whole of validity but has multiple facets (Downing S 2003)
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Construct (Concepts ideas and notions) - Nearly all assessments in medical education deal with constructs intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory- Educational achievement is a construct inferred from performance on assessments written tests over domain of knowledge oral examinations over specific problems or cases in medicine or OSCE history-taking or communication skills- Educational ability or aptitude is another example of construct ndash a construct that may be even more intangible and abstract than achievement (Downing 2003)
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Downing 2003 Cook S 2007
Content do instrument items completely represent the construct
Response process the relationship between the intended construct and the thought processes of subjects or observers
Internal structure acceptable reliability and factor structure
Relations to other variables correlation with scores from another instrument assessing the same construct
Consequences do scores really make a difference
Sources of validity in assessment
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of validity in assessment
Content Response process
Internal structure
Relationship to other variables
Consequences
- Examination blueprint
- Representativeness of test blueprint to achievement
domain
- Test specification
- Match of item content to test specifications
- Representativeness of items to domain
- Logicalempirical
relationship of content tested domain
- Quality of test questions
- Item writer qualifications
- Sensitivity review
- Student format familiarity
- Quality control of electronic
scanningscoring
- Key validation of preliminary scores
- Accuracy in combining different formats scores
- Quality controlaccuracy of final scoresmarksgrades
- Subscoresubscale analyses
1-Accuracy of applying pass-fail decision rules to scores
2-Quality control of score reporting
bull Item analysis data
1 Item difficultydiscrimination
2 Itemtest characteristic curves
3 Inter-item correlations
4 Item-total correlations (PBS)
bull Score scale reliability
bull Standard errors of
measurement (SEM)
bull Generalizability
bull Item factor analysis
bull Differential Item
Functioning (DIF)
bull Correlation with other relevant variables (exams)
bull Convergent correlations -
internalexternal
- Similar tests
bull Divergent correlations internalexternal
- Dissimilar measures
bull Test-criterion correlations
bull Generalizability of evidence
bull Impact of test scoresresults
on studentssociety
bull Consequences on learnersfuture
learning
bull Reasonableness of method of establishing pass-fail (cut) score
bull Pass-fail consequences
1 PF Decision reliability-accuracy
2 Conditional standard error of measurement
bull False +ve-ve
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of validity 1-Internal Structure
Statistical evidence of the hypothesized relationship between test item scores and
the construct1 -Reliability (internal consistency)
1048708 Test scale reliability1048708 Rater reliability1048708 Generalizability
2 -Item analysis data1048708 Item difficulty and discrimination1048708 MCQ option function analysis1048708 Inter-item correlations
3- Scale factor structure4 -Dimensionality studies
5 -Differential item functioning (DIF) studies
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of validity
2-Relationship to other variables
Statistical evidence of the hypothesized relationship between test scores and the
construct1048710 Criterion-related validity studies1048710 Correlations between test scoressubscores and other measures1048710 Convergent-Divergent studies
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Keys of reliability assessment
ldquoStabilityrdquo related to time consistency ldquoInternalrdquo related to the instruments ldquoInter-raterrdquo related to the examinersrsquo
criterion ldquoIntra-raterrdquo related to the examinerrsquos
criterion
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Keys of reliability assessment
Validity and reliability are closely related
A test cannot be considered valid unless the measurements resulting from it are reliable Likewise results from a test can be reliable and not necessarily valid
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Internal consistency
- Do all the items on an instrument measure the same construct (If an instrument measures more than one construct a single score will not measure either construct very well
- We would expect high correlation between item scores measuring a single construct
- Internal consistency is probably the most commonly reported reliability statistic in part because it can be calculated after a single administration of a single instrument
- Because instrument halves can be considered ldquoalternate formsrdquo internal consistency can be viewed as an estimate of parallel forms reliability
Split-half reliability
Kuder-Richardson 20
Cronbachrsquos alpha
- Correlation between scores on the first and second halves of a given instrument
- Similar concept to split-half but accounts for all items
- A generalized form of the
Kuder-Richardson formulas
- Rarely used because the ldquoeffectiverdquo instrument is only half as long as the actual instrument Spearman-Browndagger formula can adjust
-Assumes all items are equivalent measure a single construct and have dichotomous responses
- Assumes all items are equivalent and measure a single construct can be used with dichotomous or continuous data
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Temporal stability
Parallel forms
Agreement (inter-rater
reliability)
Does the instrument produce similar results when administered a second
time
Do different versions of the ldquosamerdquo instrument produce similar results
When using raters does it matter who does the rating
Is one raterrsquos score similar to anotherrsquos
Test-retest reliability
Alternate forms reliability
Percent agreement
Phi
Kappa
Kendallrsquos tau
Intraclass correlation coefficient
Administer the instrument to the same person at different times
Administer different versions of the instrument to the same individual at the same or
different times
identical responses
Simple correlation
Agreement corrected for chance
Agreement on ranked data
ANOVA to estimate how well ratings from different raters coincide
Usually quantified using correlation (eg Pearsonrsquos r)
Usually quantified using correlation (eg Pearsonrsquos r)
Does not account for agreement that would occur by chance
Does not account for chance
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Sources of reliability in assessment
Source of reliability
Description Measures Definitions Comments
Generalizability theory
How much of the error in measurement is the result of each factor (eg item item grouping subject rater day of administration) involved in the measurement process
Generalizability coefficient
Complex model that allows estimation of multiple sources of error
As the name implies this elegant method is ldquogeneralizablerdquo to virtually any setting in which reliability is assessed
For example it can determine the relative contribution of internal consistency and
inter-rater reliability to the overall reliability of a given instrument
ldquoItemsrdquo are the individual questions on the instrument The ldquoconstructrdquo is what is being measured such as knowledge attitude skill or symptom in a specific areaThe Spearman Brown ldquoprophecyrdquo formula allows one to calculate the reliability of an instrumentrsquos scores when the number of items is increased (or decreased)
Cook and Beckman Validity and Reliability of Psychometric Instruments (2007)
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Keys of reliability assessment
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Oral Exams1048708 Rater reliability1048708 Generalizability TheoryObservational Assessments1048708 Rater reliability1048708 Inter-rater agreement1048708 Generalizability TheoryPerformance Exams (OSCEs)1048708 Rater reliability1048708 Generalizability Theory
Keys of reliability assessment
Different types of assessments require different kinds of reliability
Written MCQs1048708 Scale reliability1048708 Internal consistencyWrittenmdashEssay1048708 Inter-rater agreement1048708 Generalizability Theory
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Keys of reliability assessment
Reliability ndash How high
1048710 Very high-stakes gt 090 + (Licensure tests)1048710 Moderate stakes at least ~075 (OSCE)1048710 Low stakes gt060 (Quiz)
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Keys of reliability assessment
How to increase reliabilityFor Written tests1048708 Use objectively scored formats1048708 At least 35-40 MCQs1048708 MCQs that differentiate high-low studentsFor performance exams1048708 At least 7-12 cases1048708 Well trained SPs1048708 Monitoring QCObservational Exams1048708 Lots of independent raters (7-11)1048708 Standard checklistsrating scales1048708 Timely ratings
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Conclusion
Validity = Meaning1048708 Evidence to aid interpretation of assessment data1048708 Higher the test stakes more evidence needed1048708 Multiple sources or methods1048708 Ongoing research studiesReliability1048708 Consistency of the measurement1048708 One aspect of validity evidence1048708 Higher reliability always better than lower
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
References National Board of Medical Examiners United States Medical Licensing Exam Bulletin Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners Available at httpwwwusmleorgbulletin2005testinghtm Norcini JJ Blank LL Duffy FD Fortna GS The mini-CEX a method
for assessing clinical skills Ann Intern Med 2003138476-481 Litzelman DK Stratos GA Marriott DJ Skeff KM Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers Acad Med 199873688-695 Merriam-Webster Online Available at httpwwwm-wcom Sackett DL Richardson WS Rosenberg W Haynes RB Evidence-
Based Medicine How to Practice and Teach EBM Edinburgh Churchill Livingstone 1998 Wallach J Interpretation of Diagnostic Tests 7th ed Philadelphia
Lippincott Williams amp Wilkins 2000 Beckman TJ Ghosh AK Cook DA Erwin PJ Mandrekar JN How reliable are assessments of clinical teaching
A review of the published instruments J Gen Intern Med 200419971-977 Shanafelt TD Bradley KA Wipf JE Back AL Burnout and selfreported
patient care in an internal medicine residency program Ann Intern Med 2002136358-367 Alexander GC Casalino LP Meltzer DO Patient-physician communication about out-of-pocket costs JAMA
2003290953-958
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Referen
ces - Pittet D Simon A Hugonnet S Pessoa-Silva CL Sauvan V Perneger TV Hand hygiene among physicians performance beliefs and perceptions Ann Intern Med 20041411-8- Messick S Validity In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Foster SL Cone JD Validity issues in clinical assessment Psychol Assess 19957248-260American Educational Research Association American Psychological Association National Council on Measurement in Education Standards for Educational and Psychological Testing Washington DCAmerican Educational Research Association 1999- Bland JM Altman DG Statistics notes validating scales and indexes BMJ 2002324606-607- Downing SM Validity on the meaningful interpretation of assessmentdata Med Educ 200337830-837 2005 Certification Examination in Internal Medicine InformationBooklet Produced by American Board of Internal Medicine Availableat httpwwwabimorgresourcespublicationsIMRegistrationBook pdf- Kane MT An argument-based approach to validity Psychol Bull 1992112527-535- Messick S Validation of inferences from personsrsquo responses and performances as scientific inquiry into score meaning Am Psychol 199550741-749- Kane MT Current concerns in validity theory J Educ Meas 2001 38319-342 American Psychological Association Standards for Educational and Psychological Tests and Manuals Washington DC American Psychological Association 1966- Downing SM Haladyna TM Validity threats overcoming interference in the proposed interpretations of assessment data Med Educ 200438327-333- Haynes SN Richard DC Kubany ES Content validity in psychological assessment a functional approach to concepts and methods Psychol Assess 19957238-247- Feldt LS Brennan RL Reliability In Linn RL editor Educational Measurement 3rd Ed New York American Council on Education and Macmillan 1989- Downing SM Reliability on the reproducibility of assessment data Med Educ 2004381006-1012Clark LA Watson D Constructing validity basic issues in objective scale development Psychol Assess 19957309-319
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-
Resources
For an excellent resource on item analysis httpwwwutexaseduacademicctlassessmentiar
studentsreportitemanalysisphp For a more extensive list of item-writing tips
httptestingbyueduinfohandbooksMultiple-Choice20Item20Writing20Guidelines20-20Haladyna20and20Downingpdf
httphomeschassutorontoca~murdockjteachingMCQ_basic_tipspdf
For a discussion about writing higher-level multiple choice items httpwwwasciliteorgauconferencesperth04procspdf
woodfordpdf
- Validity and Reliability in Assessment
- Slide 2
- Slide 3
- Slide 4
- Slide 5
- Slide 6
- Slide 7
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- Slide 12
- Slide 13
- Slide 14
- Slide 15
- Slide 16
- Slide 17
- Slide 18
- Slide 19
- Slide 20
- Slide 21
- Slide 22
- Slide 23
- References
- Slide 25
-