psychometrics timothy a. steenbergh and christopher j. devers indiana wesleyan university

PsychometricsTimothy A. Steenbergh and Christopher J. Devers

Indiana Wesleyan University

Overview

A. PsychometricsB. Classical Test TheoryC. ReliabilityD. Validity

A. Psychometrics

• Psychological measurement• Reliability• Validity

• Tests• Items

(Jones & Thissen, 2007; Kaplan & Saccuzzo, 2012)

B. Classical Test Theory

• Foundation for Reliability

(Kline, 2005)

For those who like pictures…

Proportion of True to Observed Score

BDI Score (X)Depression Level

(True Score)

Observed Score

BDI Score (X)

Measurement Error (E)

Depression Level (True Score)

Adding it up…

Depression Level (True score)

Error

Depression Level

+ Measurement Error

Observed Score

C. Reliability

• What does it mean to be reliable?• Consistency of scores over time, across test forms,

or across variable testing conditions• Types of Reliability

• Test-Retest• Inter-item (internal)• Inter-rater

(Anastasi, 1988)

C.1. Test-Retest Reliability

• Are test scores stable over time?• Give test to same group at 2 points in time and

correlate test scores• Must consider stability of construct when

• establishing test-retest interval• interpreting test-retest correlation

C.2. Internal (inter-item) Consistency• Assumption: A composite score has to be made up

of items that are measuring the same phenomenon• Heterogenous items will produce a lower internal

consistency reliability coefficient• Measures of internal consistency:

• Split Half• Cronbach’s Alpha (coefficient α)• Kuder Richardson-20 (KR20; for dichotomous items)

(Pedhazur & Schmelkin, 1991)

Interpreting Reliability Coefficients• What is a reasonable level of reliability?

• Research ≥ .80• Clinical ≥ .90

• Factors to consider when evaluating a reliability coefficient:

• Stability of construct• Dimensional nature of construct (uni- vs. multi-)• Number of items (short tests are less reliable)

C.3. Inter-Rater Reliability

• Accuracy (consistency) with which different raters arrive at the same scores

• Extremely important for tests that require any rater judgment (eg, WAIS vocabulary)

• Agreement is computed with Kappa statistic• Ranges from -1.0 - +1.0• K = 1.0 perfect agreement, 0 chance agreement, -

1.0 less than chance agreement• .40 - .75 “fair” • >.75 “excellent”

(Fleiss, 1981)

D. Validity

• If something is valid, what does that mean?• Validity: degree to which a test measures that

which it purports to measure• Types

• Content• Criterion-related• Construct

D.1. Content Validity

• How well does the instrument sample from the domain of interest?

• Lack of adequate item sampling can lead to invalid findings

• Examples• GBQ (see p. 144 of article)• WAIS

• Assess with Expert raters

D.2. Criterion-Related Validity• Does the test score correlate with other measures

as we would expect? • Concurrent validity: test score relates to a criterion

measured at the same time• Predictive validity: test score predicts a future

criterion• Validity coefficient: correlation coefficient between

test score and criterion measure

D.3. Construct Validity

• Is there evidence that the measure adequately assesses the construct of interest?

• Do test scores change over time or as a result of certain events, as theorized?

• Are items homogeneous, or do certain items “hang together?” (Factor Analysis)

Factor Analysis

• Statistical method for examining underlying constructs (latent traits) within a test

• Uses correlation matrices to identify underlying relationships among test items

• Example: GBQ

Overview

• Psychometrics• Psychological

measurement

• Classical Test Theory• Reliability

• Test-Retest• Inter-item (internal)• Inter-rater

• Validity• Content• Criterion-related• Construct

(Trochim, 2006)

http://www.socialresearchmethods.net/kb/relandval.php

Resources

• Software• SPSS• PSPP• R

• Videos• Educator.com• CLI: Research Seminars• Andy Field

• Websites• Social Research Methods• Institute for Digital Research and Education• Statistics Help for Students• Stat Pages

http://www-01.ibm.com/software/analytics/spss/

http://www.gnu.org/software/pspp/

http://www.r-project.org/

http://www.educator.com/

http://www.indwes.edu/CLI/Research/Research-Seminars/

https://www.youtube.com/playlist?list=PL343F1B5F55734D55

http://www.socialresearchmethods.net/

http://www.ats.ucla.edu/stat/

http://www.ats.ucla.edu/stat/

http://statistics-help-for-students.com/

http://statpages.org/

References

Anastasi, A. (1988). Psychological testing (6th ed.). New York, NY: MacMillan.Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.).

New York, NY: John Wiley & Sons.

Jones, L. V., & Thissen, D. (2007). A history and overview of psychometrics. Handbook of statistics, 26, 1-28.

Kaplan, R., & Saccuzzo, D. (2012). Psychological testing: Principles, applications, and issues. Belmont, CA: Cengage Learning.

Kline, T. J. B. (2005). Classical test theory: Assumptions, equations, limitations, and item analyses. In T. J. B. Kline, Psychological testing: A practical approach to design and evaluation (pp. 91-106). Thousand Oaks, CA: Sage.

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design and analysis: An integrated approach. Hillsdale, NJ: Lawrence Earlbaum.

Trochim, W. M. K. (2006). Reliability and validity. Retrieved from http://www.socialresearchmethods.net/kb/relandval.php

Questions

• [email protected]

• [email protected]

• EdProfessor.com

http://www.edprofessor.com/

psychometrics timothy a. steenbergh and christopher j. devers indiana wesleyan university

Documents

validity slide

testretest reliability

testretest correlation

test measures

reliable slide

test forms

criterion measure slide

construct validity