applications of irt models

Applications of Applications of IRT ModelsIRT Models

DIF and CATDIF and CAT

Which of these is the Which of these is the situation of a biased test? situation of a biased test?

The average score for males and The average score for males and females is different on an item is females is different on an item is not the same.not the same.

The correlation between males’ The correlation between males’ scores on an item is stronger than scores on an item is stronger than that for the females’ scores. that for the females’ scores.

A group of males and females with A group of males and females with exactly the same ability achieve exactly the same ability achieve different scores on an item.different scores on an item.

Disentangling the Disentangling the TerminologyTerminology Item impactItem impact

Item impact is evident when examinees from different groups Item impact is evident when examinees from different groups have differing probabilities of responding correctly to (or have differing probabilities of responding correctly to (or endorsing) an item because there are true differences endorsing) an item because there are true differences between the groups in the underlying ability being measured between the groups in the underlying ability being measured by the item.by the item.

DIFDIF The differential probability of a correct response for The differential probability of a correct response for

examinees at the same trait level but from different groups. examinees at the same trait level but from different groups. DIF occurs when examinees from different groups show DIF occurs when examinees from different groups show

differing probabilities of success on (or endorsing) the item differing probabilities of success on (or endorsing) the item after matching on the underlying ability after matching on the underlying ability that the item is that the item is intended to measure.intended to measure.

Item biasItem bias Item bias occurs when examinees of one group are less likely Item bias occurs when examinees of one group are less likely

to answer an item correctly (or endorse an item) than to answer an item correctly (or endorse an item) than examinees of another group because of some characteristic of examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test the test item or testing situation that is not relevant to the test purpose. purpose.

Adverse ImpactAdverse Impact Adverse impact is a legal term describing the situation in Adverse impact is a legal term describing the situation in

which group differences in test performance result in which group differences in test performance result in disproportionate examinee selection or related decisions (e.g., disproportionate examinee selection or related decisions (e.g., promotion). This is promotion). This is not not evidence for test bias.evidence for test bias.

No DIFNo DIF

There are two types of There are two types of DIFDIF

Uniform DIF Uniform DIF The referent group always has a higher The referent group always has a higher

probability of a correct response than probability of a correct response than that for the focal group.that for the focal group.

Non-uniform DIFNon-uniform DIF The direction of the advantage of one The direction of the advantage of one

group’s likelihood of a correct response group’s likelihood of a correct response changes in different regions of the changes in different regions of the ability scale.ability scale.

Uniform DIFUniform DIF

Non uniform DIFNon uniform DIF

Differential Test FunctioningDifferential Test Functioning

DTF Against Reference Group

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Theta

Pro

po

rtio

n C

orr

ec

t T

rue

Sc

ore

Focal

Reference

Relationship between IRT Relationship between IRT and CTST modelsand CTST models

It has been shown that there is a It has been shown that there is a relationship between 2 PL normal ogive relationship between 2 PL normal ogive IRT models and the single factor FA model IRT models and the single factor FA model (Lord & Novick, 1968)(Lord & Novick, 1968) The b-parameter is related to the threshold The b-parameter is related to the threshold

parameter divided by the item factor loadingparameter divided by the item factor loading The discrimination parameter is e2qual to the The discrimination parameter is e2qual to the

factor loading divided by the communality of factor loading divided by the communality of the itemthe item Highly discriminating items will have high factor Highly discriminating items will have high factor

loadingsloadings

Examining Measurement Examining Measurement Invariance in CTSTInvariance in CTST

Examining factorial invarianceExamining factorial invariance Configural invarianceConfigural invariance

Zero and non-zero loading patterns are the same across Zero and non-zero loading patterns are the same across groupsgroups

Pattern (metric) invariancePattern (metric) invariance The factor loadings are equal across groupsThe factor loadings are equal across groups

Scalar (strong) invarianceScalar (strong) invariance The factor loadings and intercepts are equal across The factor loadings and intercepts are equal across

groupsgroups Any group differences in means can be attributed to the Any group differences in means can be attributed to the

common factors, which allows for meaningful group mean common factors, which allows for meaningful group mean comparisonscomparisons

Strict invarianceStrict invariance Factor loadings, intercepts, and unique variances are Factor loadings, intercepts, and unique variances are

equal across groupsequal across groups Any systematic differences in group means, variances, or Any systematic differences in group means, variances, or

covariances are due to the common factorscovariances are due to the common factors

Examining DIF in IRTExamining DIF in IRT IRT tests of DIF examine if the IRC (Item response IRT tests of DIF examine if the IRC (Item response

curve) the same for the reference group as it is for curve) the same for the reference group as it is for the focal group.the focal group. The focal group is the smaller group in questions (the The focal group is the smaller group in questions (the

minority group).minority group). The reference group is the larger group that generally has The reference group is the larger group that generally has

the established parameters.the established parameters. If they are different, then this means that the probability of If they are different, then this means that the probability of

an individual in one group with ability x responding an individual in one group with ability x responding correctly is different than the probability of an individual correctly is different than the probability of an individual with the same ability x in group two if getting the item with the same ability x in group two if getting the item correct.correct.

DTF refers to a difference in the test characteristic DTF refers to a difference in the test characteristic curves, obtained by summing the item response curves, obtained by summing the item response functions for each group.functions for each group.

DTF is perhaps more important for selection because DTF is perhaps more important for selection because decisions are made based on test scores, not individual decisions are made based on test scores, not individual item responses.item responses.

Procedures for Detecting Procedures for Detecting DIF/DTFDIF/DTF

Parametric ProceduresParametric Procedures Compare item parameters from Compare item parameters from

two groups of examineestwo groups of examineesLord’s Chi-SquareLord’s Chi-SquareLikelihood Ratio Test Likelihood Ratio Test

Compare IRFs from two groups Compare IRFs from two groups of examinees by measuring of examinees by measuring areas between themareas between themRaju’s Area MeasuresRaju’s Area Measures

Likelihood Ratio TestLikelihood Ratio Test

Distributed as a chi-square with degrees of Distributed as a chi-square with degrees of freedom equal to the difference in the number freedom equal to the difference in the number of parameters estimated in the compact and of parameters estimated in the compact and the augmented modelthe augmented model The compact model assumes item parameters are The compact model assumes item parameters are

the same for both groupsthe same for both groups The augmented model constrains anchor items to The augmented model constrains anchor items to

be equal, but allows items of interest to have be equal, but allows items of interest to have parameters that vary across groupsparameters that vary across groups

2 2log (compact model) 2 log (augmented model)jG L L

Raju’s Area MeasuresRaju’s Area Measures Signed and unsigned areasSigned and unsigned areas

Indicates the area between two IRCsIndicates the area between two IRCs Requires separate calibrations of the item parameters in each Requires separate calibrations of the item parameters in each

group, then use a linear transformation to put them on the group, then use a linear transformation to put them on the same scalesame scale

2 1

2 1

1 2 1 2 2 12 1

1 2 1 2

Signed area

Unsigned area

2Unsigned area ln 1 exp

D

D


Non Parametric ProceduresNon Parametric Procedures Bivariate frequencies between item Bivariate frequencies between item

responses and group memberships responses and group memberships conditional on levels of ability or trait conditional on levels of ability or trait estimation Logistic Regressionestimation Logistic Regression Simultaneous Item Bias Test (SIBTEST)Simultaneous Item Bias Test (SIBTEST) Mantel-Haenszel (MH)Mantel-Haenszel (MH) Logistic RegressionLogistic Regression


Simultaneous Item Bias Test (SIBTEST)Simultaneous Item Bias Test (SIBTEST) Examinees are matched on a true score Examinees are matched on a true score

ability estimate of abilityability estimate of ability Creates a weighted mean difference Creates a weighted mean difference

between the reference and focal groups, between the reference and focal groups, which is then tested statisticallywhich is then tested statistically The means are adjusted to correct for The means are adjusted to correct for

differences in the ability distributions with a differences in the ability distributions with a regression correction procedureregression correction procedure Some examination of this procedure has been Some examination of this procedure has been

conducted to examine changes in Type I error rates conducted to examine changes in Type I error rates when the percent of DIF items is largewhen the percent of DIF items is large

SIBTESTSIBTEST

0

1

: 0

: 0UNI

UNI

UNI F

H

H

B f d

, ,

is the density function for in the focal group

is the differential of theta

F

B P R P F

f

d

Mantel-Haenszel (MH)Mantel-Haenszel (MH)

Compares the item performance of Compares the item performance of two groups who were previously two groups who were previously matched on the ability scalematched on the ability scale Total test score can be usedTotal test score can be used K 2x2 contingency tables are made for K 2x2 contingency tables are made for

each item for K number of ability levelseach item for K number of ability levels DIF is shown if the odds of correctly DIF is shown if the odds of correctly

answering the item at a given score answering the item at a given score level is difference for the two groupslevel is difference for the two groups


Group j Right (1) Wrong (0)Reference

group Aj Bj

Focal group Cj Dj

Response to Suspect Item

1 1j j

j j

R F

R F

p pp p


The statistic for detecting DIF in an item isThe statistic for detecting DIF in an item is

2

1 1

1

..1

..1

0.5

/

/

2.35ln( )MH

K K

j jj j

K

jj

K

j j jj

MH K

j j jj

MH

A E A

MHVar A

A D N

B C N

•Type A items – negligible DIF with ΔαMH

< |1|

•Type B items – moderate DIF with |1|<= ΔαMH <= |1.5, and MH test is statistically significant|

•Type C items – large DIF with ΔαMH > |1.5|

Logistic RegressionLogistic Regression( )

( )

0 1 2 3

( 1 | )1

( 1 | ) is the conditional probability of obtaining a

correct answer given independent variables

( )

is the independent (group) variable

is the m

f x

f x

ep u

ep u

f x G G

G

X

X

X

atching criterion (normally test score)

If the group effect is significant and the interaction is not, then If the group effect is significant and the interaction is not, then there is uniform DIFthere is uniform DIF

If the interaction is significant, then there is non-uniform DIFIf the interaction is significant, then there is non-uniform DIF Conduct model comparisons by adding each successive model termConduct model comparisons by adding each successive model term

Computerized Adaptive Computerized Adaptive Testing (CAT)Testing (CAT)

To obtain equal precision of To obtain equal precision of measurement to that of a linear test, measurement to that of a linear test, but with greater efficiency.but with greater efficiency. Give people only the items that are Give people only the items that are

informative about them.informative about them. Reduce testing time and opportunity for Reduce testing time and opportunity for

error.error.

CAT SystemCAT System

Initial ability estimate. Mean Prior

Select first item. Most discriminating. Least discriminating.

Estimate ability. MLE Bayesian Methods

Select items. Max info. Exposure control. Content specs.

Check stopping rule. SE stopping rule. Max # of items.

Estimate ability.

Issues of Research in a CAT Issues of Research in a CAT system.system.

Early IssuesEarly Issues Precision of measurementPrecision of measurement

Estimation procedure, Prior estimatesEstimation procedure, Prior estimates EquivalenceEquivalence

Reliability of Estimate, Test Form Equivalence (Test Reliability of Estimate, Test Form Equivalence (Test Information), Testing ModeInformation), Testing Mode

EfficiencyEfficiency Item selection methods, Test lengthItem selection methods, Test length

Newer IssuesNewer Issues SecuritySecurity

Item exposureItem exposure Tetstlet modelsTetstlet models

Item Exposure and Item Item Exposure and Item Selection MethodsSelection Methods

Sympson-HetterSympson-Hetter Directly controls item exposure probabilisticallyDirectly controls item exposure probabilistically

Places a filter between item selection and item Places a filter between item selection and item administrationadministration

Items are administered below a prespecified maximum Items are administered below a prespecified maximum exposure rateexposure rate

P(S) probability that an item is selected as the best P(S) probability that an item is selected as the best itemitem

P(A) probability that an item is administeredP(A) probability that an item is administered P(A|S) conditional probability that an item is P(A|S) conditional probability that an item is

administered given that it is selectedadministered given that it is selected Item exposure parameterItem exposure parameter

P(A)=P(A|S)*P(S)<=P(A)=P(A|S)*P(S)<=rrmaxmax P(A|S) is easy to determine if P(S) is known, but P(S) P(A|S) is easy to determine if P(S) is known, but P(S)

must be determined through an iterative process must be determined through an iterative process


Conditional Sympson-Hetter or SLC Conditional Sympson-Hetter or SLC (Sotcking and Lewis, 1998)(Sotcking and Lewis, 1998) SH controls that item exposure for a SH controls that item exposure for a

population, but at various ability levels population, but at various ability levels the exposure rates can be quite highthe exposure rates can be quite high

P(A|S) is determined at specific trait P(A|S) is determined at specific trait levels rather than across a populationlevels rather than across a population


aa-stratified design (STR CAT; Chang & Ying, -stratified design (STR CAT; Chang & Ying, 1996, 1999)1996, 1999) Partition the item pool into multilevels and Partition the item pool into multilevels and

multistages according to the discrimination multistages according to the discrimination parametersparameters

Start with the less discriminating itemsStart with the less discriminating items This approach seems to improve item pool This approach seems to improve item pool

utilization and balanced item exposure ratesutilization and balanced item exposure rates Then use a Then use a bb-matching item selection procedure-matching item selection procedure

It is less computationally complexIt is less computationally complex No other restrictions on item exposure is imposedNo other restrictions on item exposure is imposed