pearson educational measurement

Running head: IMPUTATION METHODS FOR NULL CATEGORIES

Imputation Methods for Handling Null Categories in Polytomous Items

Leslie Keng

The University of Texas at Austin

Ahmet Turhan

Pearson Educational Measurement

This paper will be presented at the annual meeting of the National Council on Measurement

in Education, Chicago, IL, April, 2007.

Correspondence concerning this article should be addressed to Leslie Keng, Department of

Educational Psychology, 1 University Station, Mail Station D5800, University of Texas, Austin, TX,

78712. E-mail: lkeng@mail.utexas.edu.

Imputation Methods for Null Categories 1

Abstract

In large-scale assessments, low examinee motivation on field-tests can lead to the lack of

observed scores at the highest score levels of polytomous items. Such a score level is referred to as

an extreme null category. Imputation methods can be used to assign examinees to extreme null category

so that the polytomous item can be calibrated. The current study compares three methods of

imputing extreme null categories. The three methods differ in the amount of information used to

decide on the target and frequency of imputation. Item parameters from 27 field-test forms are

taken from a recent administration of a statewide ELA assessment. These item parameters are used

to simulate item responses for 4,050,000 total examinees with ability parameters generated from a

N(-0.5, 1) distribution. The three imputation methods are applied to the simulated datasets and are

compared on several goodness-of-recovery (GOR) measures. The study finds that, when the overall

imputation demand is high, the use of historical performance data and random sampling to impute

scores leads to poor parameter recovery and negatively biased estimation of the highest step values

for imputed items. In contrast, the use of information from the current test form to impute extreme

null categories produces good overall parameter recovery, and is recommended as the ideal

imputation method because of its ease of implementation. Educational implications and limitations

of the study findings are also discussed.

Imputation Methods for Handling Null Categories in Polytomous Items

In high-stakes assessments, items are typically field-tested prior to inclusion on an

operational test. Because the quality and characteristics of the item are being tested, an examinee’s

performance on the field-test items typically does not count towards his or her overall score. In

many cases, field-test items are embedded into operational test forms so that the examinees cannot

distinguish between field-test items and operational items. Sometimes, however, embedding field-

test items is not feasible, and separate field-test forms are administered. In these cases, examinees

usually realize that their test scores do not count. As a result, their motivation on the test is low and

they do not try as hard as they would in an operational setting. Researchers have examined this field-

test effect and shown that students generally do not perform as well on low-stakes field tests than on

the high-stakes operational administrations (DeMars, 2000; Wolf and Smith, 1995).

One consequence of the field-test effect on a polytomously-scored item is that it can

negatively skew the item’s score distribution. In cases where the item itself is more difficult, the

field-test effect can skew the distribution to a point where no students receive a score in the highest

category. For example, for a 4-point essay item on an English Language Arts test, there may be no

students who attain a score point of “4” on the field test. This poses a problem during item

calibration. If, for instance, we use the Partial Credit model (Masters, 1982) to calibrate a

polytomous item, then the lack of scores in the top category would preclude us from estimating a

step value for that category.

Wilson and Masters (1993) have termed a score category with zero frequency a null category.

Several approaches have been employed to deal with null categories. One common approach is to

collapse out a null score category by reducing the score of any higher categories by one. This

approach alters the relationship between the substantive framework and the scoring scheme for such

an item, and is generally not recommended (Wilson, 1991). Wilson and Masters (1993) describe a

simple reformulation of the Partial Credit model that allows all categories to be retained and their

step values estimated. This approach has been implemented in several Rasch calibration programs,

such as with the “STKEEP” option in WINSTEPS (Linacre, 2003). The approach, however,

requires the null category to be an intermediate one and does not apply when it is an extreme one, such

as the highest or lowest score category.

One way to handle an extreme null category is with imputation. With imputation, a group of

examinees are chosen and assigned the score value of the extreme null category. The group of

examinees to impute may be chosen based on their item response information for similar test items

or related test sections. Thus, a variety of imputation methods are possible in a given context.

Imputation Methods

Two key issues need to be considered when imputing data. The first is the frequency; that is,

how many examinees should we impute? The second is the target; that is, for which examinees

should we impute the highest score value? How a practitioner addresses these two issues leads to

imputation methods that vary in the amount of examinee item response information used. These

methods can be broadly classified into three types. The three types of imputation methods are

described below from one that uses the least amount of information to one that uses the most.

1. Use Fixed Percentage and Random Sampling. For imputation methods in this category, the frequency

of examinees imputed is a fixed percentage, and the target for imputation is sampled from the

entire group of examinees. The fixed percentage can be determined from historical data, such as

the mean or median percentage of examinees who achieved the highest score level for similar

items on previous administrations of the assessment. The fixed percentage of examinees is then

randomly sampled from all examinees who took the test form in question; and the sampled

examinees are assigned the highest score value for the polytomous item with a null category.

2. Use Information from Current Test Form Only. For imputation methods in this category, the

frequency and target of imputation are based on the performance of examinees in a related

section of the same test form. For example, one could use the performance of the examinees in

the dichotomously-scored multiple-choice section on the same test form. All examinees who

achieve the highest total score in the multiple-choice section can be assigned the highest score

value for the polytomous item in question.

3. Use Information from Related Test Forms – For this method, the performance of examinees on other

related test forms is used to determine the imputation frequency and target. For example, the

performance of examinees on similar polytomous items on the other test forms in the same field-

test administration can be examined. Similar items can be, for instance, items that are the same

item type and in the same item position on the other test forms. For all similar items that do not

have a null category, the percentage of students who have achieved the highest score level can be

computed. That same percentage (frequency) of examinees from the test form in question can

then be imputed. The examinees to impute (target) should match the ones who achieved the

highest score level on the other test forms. This could be done by sampling examinees that have

equivalent percentile ranks total scores in a related section, such as the multiple-choice items.

Research Objective

Our review of literature failed to reveal any research that evaluated or compared imputation

methods for handing extreme null categories. In practice, however, imputation is routinely used in

high-stakes assessments on polytomous items from field-test forms that have null categories for

their highest score level. The purpose of our study is to compare different methods of imputing

data for this type of null category. Specifically, our study aims to answer the question: What type of

information used in the imputation method leads to the most accurate item parameter estimates?

Common sense would lead one to believe that it is better to use as much information as

possible. However, using more information requires more time and effort on the part of the

practitioner to implement the method. And the trade-off between accuracy of parameter estimation

and implementation time and effort may not be equitable. Thus, to address our research question,

implementations of the three types of imputation methods described above are applied to simulated

item responses data and their performances are compared. The findings should help inform

researchers and practitioners on the appropriate amount of information to use when it is necessary

to impute data for field-test administrations of large-scale assessments.

Method

Sample

The study used simulated datasets based on the item parameters from a real dataset. The

real dataset was taken from a recent field-test administration of a statewide English language arts

(ELA) assessment. For this separate field-test administration, a total of 92,996 students took 31

field-test forms such that each test form had on average 3,000 students responding. The ELA

assessment consists of 42 multiple-choice items, 3 open-ended short response items, and 1 extended

response essay item. The maximum score an examinee could receive on an open-ended short

response item is 3. Thus, open-ended items were scored on a 4-point scale (i.e. possible open-ended

item scores of 0, 1, 2, or 3). The maximum score for an extended response essay item was 4. So

essay items were on a 5-point scale (i.e. possible essay item scores of 0, 1, 2, 3, or 4). The Partial

Credit model (Masters, 1982) was used to calibrate all field-test items. The Partial Credit model

simplifies to the Rasch model (Rasch, 1980) in the calibration of the dichotomously-scored multiple

choice items.

The simulated datasets were constructed based on the estimated item parameters for 27 of

the 31 field-test forms in the real dataset. The item parameter estimates for these 27 forms were

used as the true parameter values in the simulated datasets. Thus, each test form in the simulated

dataset consisted of 42 multiple-choice items with true Rasch item difficulties matching those in one

of the real test forms. It also consisted of 3 open-ended items and 1 essay item with true step values

corresponding to those on the corresponding real test form.

Item responses were generated for 3,000 students per test form and a total of 50 replications

were conducted for each test form. This resulted in a total of (3,000 examinees × 50 replications =)

150,000 sets of simulated examinee responses per test form across the replications; and a grand total

of (150,000 examinee response sets × 27 test forms =) 4,050,000 examinee response sets across all

test forms replications. Within each test form replication, the 3,000 examinee ability parameters (θ)

were generated from a normal distribution with a mean of -0.5 and standard deviation of 1. A mean

of -0.5 was used to account for the field-test effect. This value is what has been historically

observed as the mean difference in ability estimates between the field tests and operational tests for

ELA students at this grade level. Each examinee’s item responses were generated based on the

Partial Credit model (Masters, 1982), which reduces to the Rasch model (Rasch, 1980) for the

dichotomous multiple-choice items. The item response generation code was implemented in SAS

(SAS Institute Inc., 2001).

Procedures

Implementations of the three types of imputation methods described above were used to

assign highest score points for any open-ended or essay items in any simulated datasets that have

extreme null categories. The three imputation methods are also implemented in SAS (SAS Institute

Inc., 2001).

For the first imputation method (Method 1), a fixed percentage of examinees was randomly

sampled and imputed. The fixed percentage was determined from historical data. Specifically, the

percentage of examinees who attained the highest score level for each open-ended and essay item

was obtained for the past four (2003-2006) operational ELA administrations at this grade level. The

median of the four percentages was computed for each item and used as the fixed percentage to

impute. The median percentage, instead of the mean, was used to mitigate the influence of outliers;

in particular, the unusually large percentage of score point 4’s observed for the essay item in 2005.

Table 1 gives the historical percentages and median percentage for the open-ended and essay items.

Table 1: Historical (2003-2006) Percentages for Highest Score Level on the Open-ended and Essay Items for the Operational ELA Assessment

Year Open-ended #1 Open-ended #2 Open-ended #3 Essay 2003 0.17% 0.44% 0.30% 3.55% 2004 0.52% 0.36% 0.89% 3.86% 2005 0.41% 0.69% 0.62% 6.50% 2006 0.31% 0.76% 0.22% 3.96%

Median 0.36% 0.57% 0.26% 3.70%

For the second imputation method (Method 2), all examinees who achieve the highest total

score on the 42 multiple-choice items on the same test form were imputed. Thus, for example,

suppose that on Test Form #18 in the current replication, one of the open-ended items had a null

category (i.e. no examinees with a score point 3). Suppose further that the highest total score on

the multiple-choice section for the current form was 38 and it was attained by two of the simulated

examinees. Then, this method would assign a score point of 3 to these two examinees for the

open-ended item in question.

For the third imputation method (Method 3), the frequency and target of imputation were

based on the performance on examinees on similar items on other related test forms. Specifically, for

each of the 27 simulated test forms in a replication, examinees with percentile ranks of 75 or higher

(i.e. top 25% examinees) on the form’s multiple-choice section were first identified. The proportion

of the top 25% examinees that attained the highest score point were determined and then averaged

across all 27 test forms for each polytomous items. For any test form with an extreme null category

in one of its polytomous items, the average proportion of top 25% examinees who achieved the

highest score point for the item in the same position was used to impute from the top 25%

examinees on the test form in question. So, for example, suppose that on Test Form #18, the first

open-ended item had an extreme null category. Suppose also that across all 27 forms in the current

replication, the mean percentage of top 25% examinees with a score point 3 on the first open-ended

item was .50%. Then, this method would randomly choose .50% of the top 25% examinees on Test

Form #18 and assign them a score of 3 for the first open-ended item.

Data Analysis

Within each replication, the original item responses on all test forms were imputed using the

three imputation methods. This resulted in (3 methods × 27 forms × 50 replications =) 4,050 post-

imputed datasets with item responses. Each post-imputed dataset is calibrated independently with

WINSTEPS (Linacre, 2003). WINSTEPS calibrates items using the Partial Credit model (Masters,

1982), which simplifies to the Rasch model (Rasch, 1980) for the dichotomous multiple-choice

items.

The performances of the three imputation methods were then evaluated and compared on

how well they recovered the true item parameter values. Two well-known and preferred goodness-

of-recovery (GOR) measures (Maris, 1999) were used to analyze parameter recovery. The first

GOR measure was the BIAS. The BIAS is the average difference between the estimated parameter

value and the true parameter value. That is, 50

( )( )

ββ =

−=∑

Where jβ is the true value of parameter j, and jrb is the estimate of parameter j in the rth replicated

dataset (r = 1…50).

The second GOR measure was the root mean square deviation (RMSD). The RMSD is

defined as the square root of the average squared differences between the estimated and true

parameter values. That is,

ββ =

−=∑

Where jβ is the true value of parameter j, and jrb is the estimate of parameter j in the rth replicated

dataset (r = 1…50). Lower mean BIAS and RMSD measures are considered more accurate in terms

of item parameter recovery.

For each method, the BIAS and RMSD were computed for every item on each test form

within each replication. Our results aggregated over all 42 multiple-choice items on each test form.

The GOR measures were then averaged across the 27 test forms so that the mean BIAS and RMSD

of the multiple-choice items could be reported and compared for the three imputation methods, 27 42

1 1( )

k ik i

BIASBIAS

β= ==

∑∑ and

1 1( )

k ik i

RMSDRMSD

β= ==

∑∑

Where kiβ is the true Rasch difficulty value of the ith multiple-choice item on test form k (i = 1…42

and k = 1…27).

In addition, our results separately computed the BIAS and RMSD of the individual step

difficulty values for the three open-ended and one essay items on every test form. These GOR

measures were then averaged across all test forms so that the mean BIAS and RMSD could be

reported and compared for the three imputation methods,

BIASBIAS

β==∑

and ( )

RMSDRMSD

β==∑

Where ( )m jkβ is the true jth step difficulty value for the mth polytomous item (open-ended or essay) on

test form k (k = 1…27).

Results

Imputation Frequencies

Table 2 gives the number of replications for which each test form required imputation on

the four polytomous items. It also gives the mean number (frequency) of examinee scores on each

test form that were imputed under the three imputation methods for the four polytomous items.

Table 2: Frequency of Imputation for the Three Imputation Methods (Listed by test form)

Form Repsa M1b M2c M3d Repsa M1b M2c M3d Repsa M1b M2c M3d Repsa M1b M2c M3d

1 18 17 1.7 1.9 9 9 1.9 1.0 16 134 1.5 1.12 16 17 1.4 2.0 12 9 1.3 1.0 4 134 1.3 1.03 3 11 1.3 5.3 21 17 1.6 2.0 36 9 1.6 1.0 19 134 1.5 1.14 1 11 2.0 4.0 2 17 1.0 2.0 21 9 1.3 1.0 15 134 1.4 1.15 3 17 1.3 2.0 13 9 1.3 1.0 4 134 1.3 1.06 1 11 1.0 6.0 3 17 2.3 2.0 20 9 1.3 1.0 28 134 1.3 1.17 39 11 1.6 5.3 4 17 1.0 2.3 21 9 1.7 1.0 27 134 1.4 1.18 4 11 1.0 4.8 6 17 1.7 1.7 19 9 2.0 1.0 27 134 1.7 1.19 2 17 1.5 2.0 5 9 1.2 1.0 33 134 1.6 1.110 31 11 1.5 5.1 19 17 1.7 1.8 22 9 1.5 1.0 4 134 1.3 1.011 9 17 1.7 1.9 9 9 2.1 1.0 20 134 2.0 1.112 6 11 1.7 5.0 19 17 1.3 1.9 27 9 1.5 1.0 23 134 1.3 1.113 6 11 1.8 4.8 10 9 1.3 1.0 20 134 1.5 1.014 30 9 1.7 1.0 29 134 1.4 1.115 9 11 2.1 4.9 33 9 1.7 1.0 32 134 1.8 1.016 19 17 1.4 1.9 3 9 2.7 1.0 27 134 1.6 1.017 13 11 1.5 4.9 16 17 1.6 1.9 34 9 1.5 1.0 30 134 1.7 1.118 19 11 1.4 5.1 26 17 1.6 2.0 18 9 1.9 1.0 47 134 1.7 1.019 26 17 1.7 2.0 35 9 1.7 1.0 11 134 1.9 1.020 27 17 1.4 2.0 44 9 1.5 1.0 24 134 1.5 1.121 11 11 1.4 5.2 33 17 1.4 1.9 38 9 1.5 1.0 8 134 1.4 1.122 13 11 1.9 4.9 5 17 2.2 2.0 40 9 1.7 1.0 3 134 1.0 1.023 12 11 1.7 5.4 3 17 1.3 2.0 47 9 1.6 1.0 16 134 1.5 1.024 2 17 3.0 2.0 46 9 1.5 1.0 41 134 1.5 1.125 7 17 1.6 2.1 29 9 1.9 1.0 19 134 2.3 1.126 1 17 1.0 2.0 25 9 1.6 1.0 3 134 2.3 1.027 5 17 1.6 2.0 48 9 1.9 1.0 19 134 1.8 1.1

Open-Ended #1 Open-Ended #2 Open-Ended #3 Essay

a. Reps is the number of replications for this test form that required imputation for the item b. M1 is “Method 1”: the imputation method using fixed percentages and random sampling c. M2 is “Method 2”: the imputation method using information from the current test form only d. M3 is “Method 3”: the imputation method using information from related test forms

Figure 1 compares the mean number of examinees imputed for each polytomous item under

the three imputation methods, aggregated over the 27 test forms.

Figure 1: Mean Frequency of Imputation for the Three Imputation Method (across all test forms)

1.6 1.6 1.6 1.6

2.0 1.0 1.1

Method 1 Method 2 Method 3

Table 2 and Figure 1 both clearly indicate that Method 1 imputed scores for the most

examinees. This was particularly apparent for the essay item, where 134 examinees on each test

form were randomly sampled and imputed with an essay score of 4. The high frequency was a result

of the higher percentage of examinees that historically scored 4s on the essay item for the

operational ELA tests. Note that because the frequency of imputation for Method 1 was based on

historical percentages and the number of simulated examinees was the same (3,000) across test

forms, the same number of examinees was always imputed for a particular polytomous item,

regardless of test form.

Method 2 and Method 3 generally imputed scores for a similar number of examinees. Both

methods tended to impute scores for between 1 to 3 examinees; with one exception being the first

open-ended item under Method 3. In that case, a higher number of examinees (4 to 6) were

imputed. This is due to the fact that a relatively smaller number of test forms required imputing for

the first open-ended item. The average percentage of examinees that scored 3s for this item was

therefore considerably higher across all test forms and as a consequence, Method 3 imputed more

scores. In comparison, under Method 2, the imputation target and frequency were chosen based

solely on the examinees’ performances on the multiple-choice section on the current test form. Thus,

the frequency of imputation was quite stable across test forms (as seen in Table 2) and across the

four polytomous items (as seen in Figure 1).

An additional observation can be made from Table 2 about the general characteristics of the

polytomous items. Based on the number of times each item required imputing across test forms and

replications (i.e. the Reps column for each item), we can inferred that the essay item was generally the

most difficult item (hence requiring the most imputing); while the first open-ended item tended to

be the easiest. This observation was consistent with the true item parameters used to generate the

item responses as the essay item on each test form tended to have the highest average step difficulty

value while the first open-ended item usually had the lowest average step value.

GOR for Multiple Choice Items

Table 3 lists, by test form and imputation method, the mean BIAS and RMSD measures

observed for the 42 multiple-choice items across 50 replications. It also aggregates the mean BIAS

and RMSD across the 27 test forms to give an overall mean BIAS and RMSD for each method (i.e.

MCBIAS and MCRMSD ).

Table 3: Mean BIAS and RMSD of the Multiple-Choice Items for the Three Imputation Methods (listed by form)

Form M1 M2 M3 M1 M2 M31 0.02 0.00 0.00 0.05 0.05 0.052 0.01 0.00 0.00 0.05 0.05 0.053 0.04 0.00 0.00 0.07 0.05 0.054 0.01 -0.01 -0.01 0.05 0.05 0.055 0.00 -0.01 -0.01 0.05 0.05 0.056 0.02 -0.01 -0.01 0.06 0.05 0.057 0.05 0.00 0.02 0.07 0.05 0.058 0.02 -0.01 -0.01 0.06 0.05 0.059 0.02 -0.01 -0.01 0.06 0.05 0.0510 0.03 0.00 0.01 0.06 0.05 0.0511 0.02 -0.01 -0.01 0.05 0.05 0.0512 0.03 0.00 0.00 0.07 0.05 0.0513 0.02 0.00 0.00 0.06 0.05 0.0514 0.03 0.00 0.00 0.06 0.05 0.0515 0.04 0.00 0.00 0.06 0.05 0.0516 0.02 0.00 0.00 0.06 0.05 0.0517 0.06 0.01 0.01 0.08 0.05 0.0518 0.07 0.00 0.01 0.08 0.05 0.0519 0.04 0.01 0.01 0.07 0.05 0.0520 0.07 0.02 0.02 0.09 0.06 0.0621 0.04 0.00 0.01 0.07 0.06 0.0622 0.02 0.00 0.00 0.06 0.05 0.0523 0.05 0.01 0.01 0.07 0.05 0.0524 0.06 0.01 0.02 0.08 0.05 0.0525 0.03 0.00 0.00 0.06 0.05 0.0526 0.01 0.00 0.00 0.05 0.05 0.0527 0.04 0.01 0.01 0.07 0.05 0.05

Mean 0.03 0.00 0.00 0.06 0.05 0.05

BIASMC RMSDMC

Table 3 shows that the three imputation methods performed similarly in recovering the true

parameter values for the multiple-choice items. At the test form level, the mean BIAS values for the

multiple-choice items ranged from about -0.01 to 0.07 for the three methods; and the overall mean

BIAS ( MCBIAS ) values were all close to zero. The same observation can be made about the RMSD

measures as the three methods had mean RMSD ranging from around 0.05 to 0.09 for the multiple-

choice items; and the overall mean RMSD ( MCRMSD ) were all approximately 0.05. Thus, it appears

that the method with which we chose to impute scores for the polytomous items on a test did not

have any notable effects on the estimation of the dichotomous item parameters.

GOR for Open-Ended Items

Tables 4 and 5 summarize the aggregated mean BIAS and RMSD values ( ( )m jOEBIAS and

( )m jOERMSD ) obtained for the four polytomous items.1 The mean GOR measures are given for

each step difficulty value (b1…b3 for the open-ended items and b1…b4 for the essay item) as well as

for the average step value (b-bar).

Table 4: Mean BIAS for Step Difficulty Values of Polytomous Items (aggregated across 27 forms)

Item b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar

Open-Ended #1 0.01 0.08 -0.14 - -0.01 -0.02 0.08 0.26 - 0.10 -0.02 0.08 0.03 - 0.03Open-Ended #2 0.02 0.07 -0.68 - -0.20 -0.01 0.08 0.06 - 0.04 -0.01 0.08 -0.11 - -0.01Open-Ended #3 0.03 0.10 -1.96 - -0.61 0.01 0.14 -0.72 - -0.19 0.01 0.13 -0.68 - -0.18

Essay -0.02 0.02 0.04 -2.34 -0.58 -0.04 0.05 0.15 -0.10 0.02 -0.03 0.05 0.15 -0.10 0.02

Table 5: Mean RMSD for Step Difficulty Values of Polytomous Items (aggregated across 27 forms)

Item b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar

Open-Ended #1 0.06 0.13 0.80 - 0.27 0.05 0.13 0.63 - 0.21 0.05 0.13 0.68 - 0.23Open-Ended #2 0.05 0.16 1.34 - 0.43 0.05 0.16 0.66 - 0.21 0.05 0.16 0.65 - 0.21Open-Ended #3 0.07 0.23 2.32 - 0.74 0.06 0.25 1.13 - 0.34 0.06 0.24 1.00 - 0.30

Essay 0.05 0.08 0.22 3.37 0.87 0.06 0.08 0.23 0.68 0.16 0.06 0.08 0.23 0.59 0.14

In examining Tables 4 and 5, we see that the three imputation methods performed

equivalently well in recovering the lower step value parameters. That is, the mean BIAS and RMSD

values were similar and low for b1 and b2 of the three open-ended items and b1 to b3 for the essay

item. The mean BIAS values for these parameters ranged from -0.04 to 0.15 and the mean RMSD

measures ranged from 0.05 to 0.25. Thus, the method used to impute scores in the extreme null

1 For those interested in the BIAS and RMSD measures obtained for the polytomous items on each test form, please refer to the tables in the Appendix (Appendix Tables 1 to 8).

category did not appear to have an effect on the parameter estimation of the non-extreme, non-null

categories.

The contrast in the three imputation methods, however, could be seen in the estimation of

the highest step value for each of the four polytomous items. Figures 2 and 3 visually compare the

three methods’ mean BIAS and RMSD for the highest step values of the four polytomous items.

Figure 2: Comparison of Mean BIAS for the highest step value of each polytomous item (aggregated across 27 forms)

Open-Ended #1(b3)

Open-Ended #2(b3)

Open-Ended #3(b3)

Essay (b4)

Polytomous Item

Figure 3: Comparison of Mean RMSD for the highest step value of each polytomous item (aggregated across 27 forms)

Open-Ended #1(b3)

Open-Ended #2(b3)

Open-Ended #3(b3)

Essay (b4)

Polytomous Item

It is clear from both figures that Method 1 performed poorly, especially in estimating b3 of

open-ended item #3 and b4 of the essay item. The absolute mean BIAS values for these two

parameters are around 2.00 or greater and the mean RMSD values are also greater than 2.00. This

was in contrast to relatively low mean GOR measures for Methods 2 and 3. Also, the direction of

the mean BIAS values for Method 1 was negative for all four highest step values, meaning that the

method underestimated these extreme step values. This makes sense given that Method 1 imputed a

lot more scores than the other two methods, especially for the essay item, where 134 scores were

imputed for each form requiring imputation (see Table 2). Methods 2 and 3, on the other hand,

appear to be imputing a more reasonable number of examinees. Both of these methods produced

low absolute mean BIAS and mean RMSD in recovering the highest step values, with one exception

being the estimation of b3 for open-ended item #3. Even though the GOR measures in that case

were substantially lower than Method 1, the mean BIAS was still around -0.70, while the mean

RMSD was approximately 1.00.

In addition to comparing the GOR measures of the individual step difficulty values, we also

considered how well the imputation methods estimated the average step difficulty values (b-bar) of

the four polytomous items. The average step difficulty value is simply the mean of a polytomous item’s

step values. It is often used as an aggregate indicator of a polytomous item’s overall difficulty so

that it can be more directly compared to other polytomous items as well as dichotomous items.

The GOR measures for the b-bar values were already given in Table 4. Figures 4 and 5 visually

compare the three methods’ absolute mean BIAS and mean RMSD in recovering the average step

difficulty value of each polytomous item.

Figure 4: Comparison of Absolute Mean BIAS for the average step value (b-bar) of polytomous items (aggregated across 27 forms)

Polytomous Item

Figure 5: Comparison of Mean RMSD for the average step value (b-bar) of polytomous items (aggregated across 27 forms)

Polytomous Item

These two figures reflect the poor estimation of the highest step values using Method 1 on

the item’s overall perceived difficulty. This is especially apparent for open-ended item #3 and the

essay item. From Table 4, we see that, as with the estimation of the highest step value, the direction

of the BIAS for the average step difficulty for Method 1. This implies that the underestimation of

the highest step values in Method 1 led to an underestimation of the average step difficulty value,

making the imputed items appear easier than they actually are. In contrast, because Methods 2 and 3

performed well in recovering of the individual step values, they also estimate the average step

difficulty fairly accurately. Figures 4 and 5 show that the absolute mean BIAS and mean RMSD

were equally low for the two methods, meaning that they both produced good estimations of the

overall difficulty of the imputed items.

Discussion

In summary, our simulation found that all three methods performed equally well in

recovering the parameters of the multiple choice items and the lower step values of the polytomous

items. Methods 2 and 3 were also fairly accurate in their estimation of the highest steps values of

polytomous items. Method 1, however, performs poorly in its recovery of such extreme step values.

This led to underestimation of the true overall difficulties of imputed items, making these items

appear easier than they actually are.

One may attribute the poor performance of Method 1 to the disproportionately high

number of examinees per form that were chosen for imputation. This was especially apparent for

the essay item where score points of 4 were assigned to 134 examinees on each form that required

imputing. While that does appears to be part of the reason, an additional factor also should be

considered. Looking back at the imputation frequencies in Table 2, we see that Method 1 did indeed

impute substantially higher numbers of examinee scores than the other two methods for all four

polytomous items. However, Figures 2 and 3 show that it performed as well as the other two

methods in estimating b3 for open-ended item #1. Also, of the three open-ended items, Method 1

actually imputes the fewest scores (9) per form for open-ended item #3 (compared to 11 and 17 for

open-ended item #1 and #2 respectively). Yet the estimation of b3 for open-ended item #3 was not

as good as the other two open-ended items. If the number of examinees imputed per form were the

only reason for the poorer performance of Method 1, then we would have expect the estimation to

be about equally poor for all three open-ended items. These trends, however, suggest that when

more test forms require imputing for a particular item (such as open-ended #3 or the essay item),

then the fact that Method 1 imputes a disproportionately high number of examinees per form leads

to its considerably larger BIAS and RMSD in estimating the highest step values. Thus, a more

plausible explanation for Method 1’s poor performance is the overall number of examinees across

test forms that require imputation.

The implication of this finding to practitioners is that if a large-scale field-test assessment has

only a few test forms that require imputation and the number of scores that need to be imputed on

each form is reasonably low (e.g. less than 20), then the method of imputation does not seem to have

a substantive effect on parameters estimation. In such cases, randomly imputing a fixed percentage

of examinees based on a historical performance is as good of a method as other more sophisticated

imputation methods.

On the other hand, if a considerable number of test forms have polytomous items with

extreme null categories, then the method of imputation does make a difference. In our simulation,

Methods 2 and 3 both performed noticeable better than Method 1 in recovering the highest step

values of imputed items. This implies that the use of more information in deciding on the frequency

and target of imputation did lead to more accurate parameter estimation for such items. However,

the fact that Method 2 and 3 performed equally well suggests that the use of information from other

test forms is not necessary, and it is reasonable to base imputation decisions solely on item

responses from the same test form. Imputation methods based solely on information from the

current test form are less cumbersome to implement and provide the same high degree of accuracy

in item calibration. Thus, methods such as Method 2 are recommended based on the results of our

study.

Study Limitations

The current study represents an initial examination of imputation methods applied to a

scenario frequently encountered in practice, but to date, scarcely explored in research. Our study

sheds some light on the issue of what information needs to be considered in deciding how many and

which examinees to impute in situations where test scores need to be imputed for extreme null

categories. Several issues, however, still require further research.

One issue is to determine the effect that the imputation target and imputation frequency

each have on the accuracy of parameter estimation. In the current study, it is difficult to distinguish

how much of the poor performance of Method 1 is due to the large number of examinee imputed

and how much is due to the characteristics of the examinees chosen. Similarly, one wonders

whether the accuracy in parameter estimation for Methods 2 and 3 is because of the small number

of examinees imputed, the more proficient examinees that were chosen for imputation, or a

combination of the two factors. For example, would an imputation method that simply selects one

of the more proficient examinees for imputation (fixed frequency with selective target) do just as

well as Methods 2 and 3? Or, would a method that bases its imputation frequency on test form

information, but always randomly samples from the entire group of examinees (variable frequency

with non-selective target) be equally accurate in parameter estimation? Comparing implementations

of such methods with the ones in this study could further our understanding on the differential

effects of imputation target and frequency.

Another future direction is to explore the relationship between the characteristics of

polytomous items and imputation effectiveness. In the current study, Methods 2 and 3 performed

well in estimating every item parameter except for the highest step value (b3) of open-ended item #3.

No explanation could be found for this apparent aberration based on the design of the current

study. For example, Method 2 imputed, on each test form, the exact same set of examinees for all

four polytomous items. The number of test forms that require imputation is also similar for open-

ended item #3 and the essay item. Thus, it is somewhat notable that this particular parameter was

not estimated well across the forms and replications. One plausible explanation is that it is related to

characteristics of items, specifically those that are designated as the third open-ended item.

However, given that the true item parameters for this study are based on a large real dataset, it is

difficult to identify what distinguishes such an item from the other polytomous items. A study that

systematically compares item characteristics, such as the degree of separation of the step difficulty

values, may be able to shed some light on this unresolved phenomenon.

Lastly, the examinee abilities (θ) in this study were generated from a normal distribution with

mean of -0.5 to emulate the field-test effect. Given that there is a lower bound (0) an examinee can

achieve on a test and the low motivation typically observed on field-test administration, it might be

reasonable to consider other distributions such one that is negatively-skewed as well as varying

degrees of field-test effects. It would be interesting to examine the extent to which the distribution

of θ and the size of the field-test effect have on the item response distribution and consequently, the

effectiveness of different imputation methods.

Educational Implications

The findings of this and any further studies should help inform practitioners on what

information is important in deciding on the frequency and target for imputation. These decisions

have practical implications for the field-test calibration of high-stakes statewide assessments. Field-

test item statistics are often used in the scoring of retest administrations. In these cases, the test-

taking population is generally smaller so that no post-equating of the retest forms is conducted.

Because the scores on the retest are high stakes for its examinees, motivation is generally not an

issue in these administrations. Consequently, the full range of score points for polytomous items is

usually observed on the retests. However, if the step value for the highest score category was not

estimated for any item during field testing, then we could not use this item to make ability estimates

for students in the highest score category. Knowing how to handle null categories in such

polytomous items is essential to ensuring the integrity and defensibility of the testing program.

References

DeMars, C.E. (2000). Test stakes and item format interactions. Applied Measurement in Education,

13(1), 55-77.

Linacre, J. M. (2003). WINSTEPS Rasch measurement computer program. Chicago: Winsteps.com.

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64 (2), 187-212.

Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.

Rasch, G. (1980). Probabilistic Models for Intelligence and Attainment Tests. Chicago: The University of

Chicago Press.

SAS Institute Inc. (2001). The SAS System for Windows (release 8.02). Cary, NC: SAS Institute Inc.

Wilson, M. (1991). Unobserved or null categories. Rasch Measurement Transactions, 5(1), 128.

Wilson, M., & Masters, G. N. (1993). The partial credit model and null categories. Psychometrika,

58(1), 87-99.

Wolf, L.F., & Smith, J.K. (1995). The consequence of consequence: Motivation, anxiety, and test

performance. Applied Measurement in Education, 8(3), 227-242.

Appendix

Appendix Table 1: BIAS for Step Difficulty Values of Open-Ended Item #1(listed by form)

Form b1 b2 b3 b-bar b1 b2 b3 b-bar b1 b2 b3 b-bar

1 0.00 0.04 0.14 0.06 -0.03 0.03 0.20 0.07 -0.03 0.03 0.18 0.062 -0.03 0.06 0.14 0.06 -0.04 0.05 0.17 0.06 -0.04 0.05 0.16 0.063 0.03 0.10 0.16 0.10 -0.02 0.09 0.46 0.18 -0.01 0.09 0.28 0.124 -0.01 0.06 0.22 0.09 -0.03 0.06 0.32 0.12 -0.03 0.06 0.28 0.105 -0.05 0.05 0.17 0.06 -0.06 0.04 0.18 0.05 -0.06 0.04 0.17 0.056 0.02 0.08 0.24 0.12 -0.02 0.08 0.39 0.15 -0.01 0.08 0.32 0.137 0.01 0.05 -2.79 -0.91 -0.05 0.06 -0.80 -0.26 -0.03 0.06 -2.09 -0.698 -0.01 0.06 0.09 0.05 -0.04 0.06 0.42 0.15 -0.04 0.06 0.24 0.089 0.02 0.07 0.19 0.09 -0.02 0.07 0.28 0.11 -0.02 0.07 0.27 0.1110 0.02 0.14 -1.87 -0.57 0.00 0.15 -0.35 -0.07 0.01 0.14 -1.33 -0.4011 0.00 0.07 0.23 0.10 -0.03 0.07 0.29 0.11 -0.03 0.07 0.28 0.1112 0.03 0.09 0.13 0.09 -0.01 0.08 0.52 0.20 -0.01 0.09 0.31 0.1313 0.00 0.06 0.02 0.03 -0.02 0.06 0.36 0.13 -0.02 0.06 0.19 0.0814 0.00 0.09 0.15 0.08 -0.04 0.08 0.26 0.10 -0.04 0.08 0.23 0.0915 0.04 0.07 0.04 0.05 -0.01 0.08 0.56 0.21 0.00 0.08 0.30 0.1216 -0.01 0.09 0.24 0.11 -0.04 0.09 0.34 0.13 -0.04 0.09 0.31 0.1217 0.04 0.06 -0.58 -0.16 -0.01 0.05 0.14 0.06 -0.01 0.05 -0.26 -0.0718 0.05 0.12 -0.98 -0.27 -0.02 0.13 0.34 0.15 -0.01 0.13 -0.47 -0.1219 -0.01 0.10 0.23 0.11 -0.05 0.07 0.26 0.10 -0.04 0.07 0.25 0.0920 0.06 0.13 0.25 0.15 0.01 0.12 0.40 0.18 0.01 0.12 0.34 0.1621 0.04 0.10 -0.37 -0.08 0.00 0.09 0.29 0.12 0.00 0.09 -0.14 -0.0222 0.01 0.11 -0.31 -0.06 -0.02 0.11 0.34 0.15 -0.01 0.11 -0.05 0.0123 0.04 0.08 -0.56 -0.15 0.00 0.08 0.11 0.06 0.00 0.07 -0.32 -0.0824 0.07 0.12 0.26 0.15 0.00 0.10 0.37 0.16 0.01 0.10 0.34 0.1525 0.01 0.05 0.27 0.11 -0.02 0.05 0.37 0.13 -0.02 0.05 0.33 0.1226 -0.01 0.10 0.33 0.14 -0.02 0.10 0.38 0.15 -0.02 0.09 0.35 0.1427 0.01 0.08 0.20 0.10 -0.03 0.07 0.31 0.12 -0.03 0.07 0.26 0.10

Mean 0.01 0.08 -0.14 -0.01 -0.02 0.08 0.26 0.10 -0.02 0.08 0.03 0.03

Appendix Table 2: RMSD for Step Difficulty Values of Open-Ended Item #1(listed by form)

1 0.04 0.08 0.37 0.13 0.05 0.07 0.42 0.14 0.05 0.07 0.40 0.132 0.05 0.11 0.47 0.16 0.06 0.11 0.48 0.16 0.06 0.11 0.48 0.163 0.05 0.14 0.66 0.23 0.04 0.14 0.77 0.27 0.04 0.14 0.63 0.224 0.06 0.16 0.58 0.19 0.06 0.16 0.63 0.21 0.06 0.16 0.59 0.205 0.07 0.10 0.41 0.14 0.08 0.10 0.42 0.14 0.08 0.10 0.42 0.146 0.05 0.11 0.58 0.20 0.04 0.11 0.70 0.24 0.04 0.11 0.62 0.217 0.05 0.12 2.95 0.97 0.06 0.12 0.99 0.32 0.05 0.13 2.20 0.728 0.06 0.11 0.65 0.22 0.07 0.11 0.71 0.24 0.07 0.11 0.57 0.199 0.05 0.12 0.43 0.16 0.05 0.12 0.49 0.17 0.05 0.12 0.48 0.1710 0.05 0.22 2.20 0.68 0.05 0.23 0.63 0.17 0.05 0.23 1.55 0.4711 0.05 0.11 0.35 0.13 0.05 0.10 0.40 0.14 0.05 0.10 0.38 0.1412 0.06 0.12 0.77 0.26 0.05 0.12 0.78 0.27 0.04 0.12 0.67 0.2313 0.06 0.12 0.63 0.21 0.05 0.12 0.62 0.21 0.05 0.12 0.50 0.1714 0.05 0.12 0.54 0.19 0.06 0.11 0.60 0.21 0.06 0.11 0.57 0.2015 0.05 0.11 0.88 0.29 0.04 0.12 0.83 0.29 0.04 0.12 0.73 0.2516 0.07 0.14 0.58 0.20 0.07 0.14 0.63 0.22 0.07 0.14 0.61 0.2117 0.07 0.12 1.08 0.35 0.04 0.13 0.90 0.29 0.04 0.12 0.68 0.2218 0.07 0.15 1.46 0.44 0.04 0.16 0.67 0.24 0.04 0.16 0.89 0.2719 0.05 0.12 0.33 0.13 0.06 0.10 0.36 0.12 0.06 0.10 0.34 0.1220 0.08 0.16 0.50 0.21 0.05 0.15 0.62 0.24 0.05 0.15 0.56 0.2221 0.06 0.15 0.99 0.31 0.04 0.14 0.58 0.21 0.04 0.14 0.67 0.2222 0.04 0.18 1.04 0.34 0.04 0.17 0.65 0.24 0.04 0.17 0.69 0.2323 0.06 0.13 1.05 0.34 0.04 0.13 0.62 0.20 0.04 0.13 0.74 0.2424 0.08 0.15 0.43 0.18 0.05 0.13 0.51 0.19 0.05 0.13 0.48 0.1825 0.05 0.12 0.55 0.18 0.04 0.12 0.59 0.20 0.04 0.12 0.57 0.1926 0.05 0.15 0.67 0.24 0.05 0.15 0.72 0.25 0.05 0.15 0.69 0.2427 0.05 0.11 0.51 0.18 0.05 0.10 0.59 0.20 0.05 0.10 0.54 0.18

Mean 0.06 0.13 0.80 0.27 0.05 0.13 0.63 0.21 0.05 0.13 0.68 0.23

1 0.01 0.06 -1.74 -0.56 -0.01 0.08 -0.67 -0.20 -0.01 0.08 -0.84 -0.262 0.00 0.04 -1.44 -0.46 -0.01 0.05 -0.52 -0.16 0.00 0.05 -0.70 -0.223 0.02 0.06 -1.75 -0.56 -0.02 0.08 -0.36 -0.10 -0.01 0.07 -0.70 -0.214 -0.01 0.04 -0.07 -0.01 -0.03 0.03 0.13 0.04 -0.03 0.03 0.06 0.025 -0.03 0.03 0.09 0.03 -0.04 0.02 0.28 0.09 -0.04 0.02 0.24 0.076 0.00 0.08 -0.14 -0.02 -0.04 0.07 0.11 0.05 -0.04 0.07 0.07 0.047 0.05 0.12 -0.07 0.03 0.01 0.13 0.46 0.20 0.02 0.13 0.22 0.128 0.02 0.02 -0.14 -0.03 -0.01 0.05 0.31 0.12 -0.01 0.04 0.26 0.109 0.01 0.09 0.07 0.06 -0.02 0.11 0.33 0.14 -0.02 0.11 0.27 0.1210 0.01 0.07 -1.15 -0.36 -0.02 0.06 -0.06 -0.01 -0.01 0.06 -0.22 -0.0611 0.00 0.03 -0.72 -0.23 -0.02 0.04 -0.11 -0.03 -0.02 0.03 -0.24 -0.0712 0.01 0.05 -1.26 -0.40 -0.03 0.06 -0.01 0.01 -0.03 0.06 -0.30 -0.0913 0.02 0.05 0.15 0.07 -0.01 0.05 0.25 0.10 0.00 0.06 0.22 0.0914 0.03 0.11 0.12 0.09 -0.01 0.11 0.26 0.12 -0.01 0.11 0.22 0.1115 0.03 0.09 0.26 0.13 -0.01 0.09 0.44 0.17 -0.01 0.09 0.37 0.1516 0.01 0.05 -1.26 -0.40 -0.02 0.06 -0.11 -0.02 -0.02 0.06 -0.32 -0.0917 0.05 0.05 -1.18 -0.36 0.00 0.06 -0.15 -0.03 0.00 0.06 -0.32 -0.0918 0.07 0.08 -1.78 -0.54 0.00 0.11 0.02 0.04 0.01 0.11 -0.40 -0.0919 0.06 0.07 -1.94 -0.60 0.04 0.10 -0.35 -0.07 0.04 0.09 -0.69 -0.1920 0.07 0.10 -2.13 -0.65 0.03 0.15 -0.34 -0.05 0.03 0.14 -0.78 -0.2021 0.04 0.11 -2.34 -0.73 0.01 0.14 -0.33 -0.06 0.01 0.14 -0.75 -0.2022 0.01 0.08 -0.20 -0.03 -0.01 0.07 0.13 0.06 -0.01 0.07 0.05 0.0423 0.03 0.07 0.06 0.05 -0.01 0.06 0.36 0.13 -0.01 0.06 0.25 0.1024 0.04 0.08 -0.02 0.03 -0.01 0.07 0.28 0.11 -0.01 0.07 0.20 0.0925 0.02 0.07 -0.14 -0.02 -0.01 0.08 0.37 0.15 -0.01 0.08 0.24 0.1026 0.00 0.05 0.20 0.08 -0.01 0.05 0.32 0.12 -0.01 0.05 0.27 0.1027 0.04 0.10 0.06 0.07 0.00 0.11 0.52 0.21 0.00 0.10 0.39 0.16

Mean 0.02 0.07 -0.68 -0.20 -0.01 0.08 0.06 0.04 -0.01 0.08 -0.11 -0.01

1 0.04 0.19 2.13 0.70 0.04 0.20 0.89 0.28 0.04 0.20 0.97 0.302 0.04 0.17 1.94 0.64 0.04 0.17 0.69 0.22 0.04 0.17 0.82 0.263 0.05 0.17 2.24 0.73 0.04 0.17 0.67 0.21 0.04 0.17 0.84 0.264 0.04 0.12 0.71 0.22 0.05 0.12 0.66 0.21 0.05 0.12 0.58 0.185 0.05 0.12 0.83 0.26 0.06 0.13 0.73 0.23 0.06 0.13 0.69 0.226 0.04 0.14 0.71 0.24 0.05 0.14 0.54 0.19 0.05 0.14 0.52 0.187 0.07 0.21 0.81 0.27 0.05 0.21 0.87 0.31 0.05 0.22 0.62 0.238 0.05 0.14 0.87 0.29 0.04 0.15 0.69 0.23 0.04 0.15 0.65 0.229 0.05 0.19 0.74 0.24 0.05 0.20 0.72 0.25 0.05 0.20 0.66 0.2210 0.04 0.15 1.76 0.57 0.05 0.15 0.53 0.16 0.04 0.15 0.46 0.1411 0.04 0.10 1.29 0.42 0.05 0.10 0.58 0.18 0.05 0.10 0.54 0.1712 0.05 0.13 1.83 0.59 0.04 0.13 0.51 0.17 0.04 0.13 0.52 0.1713 0.06 0.12 0.60 0.19 0.05 0.12 0.61 0.20 0.05 0.12 0.60 0.2014 0.06 0.16 0.63 0.22 0.05 0.16 0.65 0.23 0.05 0.16 0.64 0.2215 0.06 0.15 0.63 0.22 0.05 0.15 0.75 0.26 0.05 0.15 0.69 0.2416 0.04 0.15 1.92 0.61 0.04 0.16 0.49 0.14 0.04 0.16 0.58 0.1717 0.07 0.14 1.72 0.55 0.05 0.15 0.88 0.28 0.05 0.15 0.54 0.1618 0.08 0.16 2.22 0.70 0.05 0.17 0.53 0.18 0.04 0.17 0.56 0.1619 0.08 0.19 2.43 0.78 0.06 0.20 0.59 0.17 0.06 0.20 0.79 0.2320 0.09 0.23 2.56 0.81 0.06 0.25 0.59 0.17 0.06 0.25 0.89 0.2421 0.06 0.24 2.71 0.87 0.06 0.25 0.53 0.15 0.06 0.25 0.85 0.2422 0.05 0.13 0.79 0.25 0.04 0.12 0.57 0.19 0.04 0.12 0.49 0.1623 0.05 0.12 0.74 0.24 0.05 0.12 0.70 0.23 0.04 0.12 0.62 0.2024 0.06 0.12 0.66 0.22 0.04 0.12 0.60 0.22 0.04 0.12 0.56 0.2025 0.06 0.19 0.99 0.32 0.05 0.20 0.74 0.25 0.05 0.20 0.62 0.2126 0.05 0.14 0.69 0.23 0.05 0.14 0.66 0.22 0.05 0.14 0.63 0.2127 0.06 0.18 0.90 0.30 0.04 0.18 0.74 0.27 0.04 0.18 0.64 0.23

Mean 0.05 0.16 1.34 0.43 0.05 0.16 0.66 0.21 0.05 0.16 0.65 0.21

1 0.02 0.09 -0.18 -0.02 0.00 0.10 0.29 0.13 0.00 0.10 0.32 0.142 0.01 0.08 -0.32 -0.08 -0.01 0.08 0.23 0.10 -0.01 0.08 0.25 0.113 0.04 0.11 -2.50 -0.78 0.01 0.17 -0.71 -0.18 0.02 0.16 -0.71 -0.184 0.00 0.05 -1.00 -0.32 -0.02 0.06 -0.04 0.00 -0.02 0.05 -0.01 0.015 -0.03 0.07 -0.48 -0.15 -0.03 0.07 0.09 0.04 -0.03 0.07 0.12 0.056 0.02 0.03 -0.86 -0.27 -0.01 0.05 0.15 0.06 -0.01 0.04 0.14 0.067 0.05 0.09 -1.22 -0.36 0.00 0.10 -0.08 0.01 0.01 0.11 -0.16 -0.018 0.02 0.12 -0.88 -0.25 -0.01 0.15 -0.10 0.01 -0.01 0.15 0.10 0.089 0.04 0.08 0.01 0.04 0.01 0.11 0.39 0.17 0.01 0.10 0.36 0.1610 0.02 0.10 -1.25 -0.38 -0.01 0.10 -0.23 -0.05 0.00 0.10 -0.22 -0.0411 0.02 0.04 -0.12 -0.02 0.00 0.06 0.37 0.14 0.00 0.05 0.36 0.1412 0.02 0.09 -1.74 -0.54 -0.01 0.12 -0.41 -0.10 -0.01 0.11 -0.40 -0.1013 0.02 0.10 -0.94 -0.27 0.00 0.12 -0.35 -0.08 0.00 0.12 -0.39 -0.0914 0.04 0.08 -2.06 -0.64 0.01 0.13 -0.71 -0.19 0.01 0.12 -0.56 -0.1415 0.03 0.07 -2.38 -0.76 0.00 0.16 -0.75 -0.20 0.01 0.14 -0.70 -0.1816 0.02 0.12 0.15 0.10 -0.01 0.13 0.37 0.16 -0.01 0.13 0.37 0.1617 0.05 0.14 -2.86 -0.89 0.01 0.19 -1.14 -0.31 0.01 0.19 -1.14 -0.3118 0.06 0.10 -0.92 -0.25 0.00 0.13 0.20 0.11 0.01 0.13 0.12 0.0919 0.04 0.09 -3.15 -1.01 0.02 0.12 -1.53 -0.46 0.02 0.11 -1.46 -0.4420 0.08 0.21 -5.37 -1.69 0.06 0.32 -3.05 -0.89 0.06 0.29 -3.14 -0.9321 0.05 0.12 -2.90 -0.91 0.02 0.17 -1.12 -0.31 0.03 0.15 -1.07 -0.3022 0.02 0.16 -3.25 -1.02 0.01 0.22 -1.47 -0.41 0.01 0.20 -1.34 -0.3723 0.04 0.15 -4.52 -1.44 0.02 0.23 -2.39 -0.71 0.02 0.20 -2.25 -0.6724 0.08 0.07 -4.99 -1.61 0.06 0.23 -2.61 -0.77 0.06 0.19 -2.60 -0.7925 0.02 0.07 -2.06 -0.66 -0.01 0.09 -0.81 -0.24 -0.01 0.08 -0.67 -0.2026 0.03 0.09 -2.47 -0.79 0.02 0.11 -1.38 -0.41 0.02 0.11 -1.28 -0.3827 0.05 0.23 -4.71 -1.48 0.02 0.30 -2.69 -0.79 0.02 0.28 -2.37 -0.69

Mean 0.03 0.10 -1.96 -0.61 0.01 0.14 -0.72 -0.19 0.01 0.13 -0.68 -0.18

1 0.06 0.17 0.82 0.26 0.05 0.18 0.67 0.23 0.05 0.18 0.69 0.242 0.05 0.17 0.98 0.31 0.05 0.17 0.60 0.21 0.05 0.17 0.59 0.213 0.07 0.25 2.72 0.85 0.05 0.29 0.96 0.27 0.06 0.28 0.78 0.194 0.06 0.13 1.46 0.47 0.06 0.14 0.41 0.13 0.06 0.14 0.29 0.105 0.05 0.15 1.08 0.35 0.06 0.15 0.36 0.13 0.06 0.15 0.35 0.136 0.06 0.13 1.36 0.43 0.06 0.14 0.50 0.17 0.06 0.14 0.39 0.147 0.07 0.16 1.61 0.50 0.05 0.17 0.63 0.20 0.05 0.17 0.45 0.148 0.05 0.21 1.37 0.43 0.05 0.24 0.86 0.28 0.05 0.23 0.44 0.179 0.07 0.17 0.68 0.23 0.05 0.19 0.71 0.25 0.05 0.18 0.69 0.2410 0.05 0.17 1.64 0.52 0.05 0.17 0.44 0.13 0.04 0.17 0.34 0.0911 0.05 0.14 0.82 0.27 0.04 0.14 0.72 0.24 0.04 0.14 0.65 0.2212 0.07 0.19 2.04 0.65 0.06 0.20 0.67 0.20 0.06 0.20 0.51 0.1413 0.05 0.20 1.30 0.41 0.05 0.21 0.58 0.17 0.05 0.21 0.56 0.1614 0.07 0.19 2.37 0.74 0.05 0.22 0.89 0.25 0.05 0.22 0.65 0.1715 0.07 0.26 2.60 0.83 0.06 0.31 0.95 0.27 0.06 0.29 0.75 0.2016 0.06 0.18 0.68 0.25 0.05 0.19 0.71 0.26 0.05 0.19 0.73 0.2617 0.06 0.26 3.07 0.96 0.04 0.29 1.37 0.40 0.04 0.29 1.18 0.3218 0.08 0.18 1.34 0.41 0.05 0.20 0.57 0.21 0.05 0.20 0.42 0.1619 0.07 0.24 3.34 1.07 0.06 0.26 1.62 0.50 0.06 0.25 1.49 0.4520 0.11 0.41 5.43 1.71 0.10 0.47 3.14 0.92 0.10 0.46 3.16 0.9321 0.07 0.26 3.07 0.97 0.06 0.30 1.21 0.34 0.06 0.28 1.10 0.3022 0.06 0.32 3.42 1.08 0.06 0.37 1.59 0.44 0.05 0.35 1.38 0.3823 0.06 0.32 4.56 1.45 0.05 0.37 2.48 0.74 0.05 0.35 2.26 0.6824 0.11 0.44 5.05 1.63 0.10 0.50 2.69 0.79 0.10 0.47 2.64 0.7925 0.05 0.20 2.32 0.75 0.04 0.21 0.98 0.31 0.04 0.20 0.72 0.2226 0.07 0.28 2.72 0.87 0.07 0.28 1.47 0.45 0.07 0.28 1.33 0.4027 0.06 0.33 4.73 1.48 0.05 0.39 2.76 0.81 0.04 0.37 2.38 0.69

Mean 0.07 0.23 2.32 0.74 0.06 0.25 1.13 0.34 0.06 0.24 1.00 0.30

Appendix Table 7: BIAS for Step Difficulty Values of Essay Item (listed by form)

Form b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar

1 -0.02 0.02 0.08 -1.60 -0.38 -0.03 0.04 0.17 0.17 0.09 -0.03 0.04 0.16 0.18 0.092 -0.04 0.02 0.08 -0.05 0.00 -0.06 0.01 0.08 0.39 0.11 -0.05 0.01 0.08 0.38 0.113 0.01 0.05 0.05 -1.93 -0.45 -0.01 0.08 0.17 0.28 0.13 0.00 0.09 0.16 0.21 0.114 -0.04 0.02 0.04 -1.31 -0.32 -0.04 0.03 0.11 0.30 0.10 -0.04 0.03 0.11 0.30 0.105 -0.06 0.02 0.06 -0.11 -0.03 -0.07 0.01 0.07 0.30 0.08 -0.07 0.01 0.07 0.31 0.086 -0.02 0.00 -0.02 -3.23 -0.82 -0.03 0.06 0.15 -0.16 0.01 -0.03 0.06 0.15 -0.18 0.007 0.00 0.02 0.03 -3.38 -0.83 -0.03 0.04 0.18 -0.29 -0.02 -0.02 0.06 0.18 -0.43 -0.058 -0.03 -0.02 0.00 -3.13 -0.79 -0.03 0.05 0.18 -0.20 0.00 -0.03 0.04 0.17 -0.13 0.019 -0.05 -0.02 -0.02 -4.20 -1.07 -0.04 0.07 0.22 -0.67 -0.10 -0.04 0.07 0.20 -0.53 -0.0710 -0.02 0.06 0.08 -0.16 -0.01 -0.05 0.03 0.07 0.33 0.10 -0.04 0.04 0.08 0.28 0.0911 -0.04 0.01 0.06 -2.13 -0.53 -0.04 0.03 0.16 -0.05 0.02 -0.04 0.03 0.15 0.04 0.0512 0.00 0.03 0.05 -2.47 -0.60 -0.02 0.06 0.20 0.19 0.11 -0.01 0.07 0.19 0.10 0.0813 -0.03 0.01 0.05 -2.29 -0.57 -0.04 0.04 0.18 -0.08 0.02 -0.04 0.04 0.17 -0.05 0.0314 -0.03 -0.01 -0.02 -3.78 -0.96 -0.04 0.05 0.17 -0.58 -0.10 -0.04 0.05 0.16 -0.59 -0.1115 -0.04 0.01 0.02 -4.07 -1.02 -0.06 0.05 0.19 -0.67 -0.12 -0.06 0.06 0.18 -0.55 -0.0916 -0.04 -0.01 -0.01 -3.32 -0.84 -0.05 0.03 0.14 -0.47 -0.09 -0.05 0.03 0.13 -0.36 -0.0617 -0.02 0.02 0.00 -3.79 -0.95 -0.05 0.05 0.16 -0.42 -0.07 -0.04 0.05 0.15 -0.50 -0.0818 -0.01 -0.03 -0.08 -6.67 -1.69 -0.04 0.05 0.25 -1.22 -0.24 -0.03 0.06 0.22 -1.32 -0.2719 -0.02 0.04 0.10 -0.95 -0.21 -0.05 0.03 0.13 0.23 0.08 -0.05 0.03 0.13 0.25 0.0920 0.01 0.06 0.06 -2.82 -0.67 -0.03 0.06 0.16 -0.17 0.01 -0.02 0.07 0.16 -0.22 0.0021 0.01 0.06 0.10 -0.68 -0.13 -0.03 0.04 0.11 0.25 0.09 -0.03 0.04 0.11 0.19 0.0822 0.00 0.07 0.11 0.10 0.07 -0.02 0.05 0.11 0.50 0.16 -0.01 0.06 0.12 0.44 0.1523 0.01 0.03 0.05 -1.78 -0.42 -0.02 0.04 0.13 0.03 0.04 -0.01 0.04 0.12 0.01 0.0424 0.01 0.00 0.00 -5.17 -1.29 -0.01 0.07 0.23 -0.67 -0.09 -0.01 0.07 0.22 -0.66 -0.0925 -0.01 0.03 0.04 -2.20 -0.54 -0.02 0.07 0.18 -0.23 0.00 -0.02 0.07 0.17 -0.10 0.0326 -0.05 0.04 0.10 0.02 0.03 -0.06 0.04 0.10 0.34 0.11 -0.06 0.04 0.10 0.36 0.1127 0.00 0.04 0.03 -2.14 -0.52 -0.01 0.07 0.16 -0.04 0.05 -0.02 0.07 0.15 0.00 0.05

Mean -0.02 0.02 0.04 -2.34 -0.58 -0.04 0.05 0.15 -0.10 0.02 -0.03 0.05 0.15 -0.10 0.02

Appendix Table 8: RMSD for Step Difficulty Values of Essay Item (listed by form)

Form b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar b1 b2 b3 b4 b-bar

1 0.05 0.07 0.20 2.89 0.74 0.05 0.06 0.22 0.54 0.15 0.05 0.06 0.22 0.48 0.142 0.07 0.06 0.12 1.31 0.34 0.07 0.05 0.12 0.69 0.18 0.07 0.05 0.12 0.70 0.183 0.06 0.11 0.20 3.16 0.81 0.06 0.11 0.25 0.64 0.19 0.06 0.12 0.24 0.51 0.164 0.06 0.07 0.17 2.61 0.68 0.06 0.07 0.16 0.61 0.16 0.06 0.07 0.16 0.58 0.165 0.08 0.05 0.12 1.22 0.31 0.08 0.05 0.12 0.61 0.15 0.08 0.05 0.12 0.62 0.166 0.05 0.09 0.21 4.17 1.08 0.05 0.08 0.24 0.53 0.11 0.05 0.08 0.23 0.41 0.097 0.04 0.07 0.24 4.27 1.09 0.05 0.07 0.26 0.63 0.13 0.04 0.08 0.26 0.59 0.108 0.06 0.09 0.27 4.11 1.07 0.06 0.08 0.27 0.63 0.14 0.06 0.08 0.27 0.43 0.109 0.07 0.11 0.28 4.94 1.28 0.06 0.10 0.33 0.85 0.16 0.06 0.10 0.32 0.63 0.1110 0.05 0.09 0.13 1.20 0.31 0.07 0.07 0.12 0.70 0.18 0.06 0.08 0.13 0.66 0.1811 0.06 0.06 0.22 3.35 0.87 0.07 0.06 0.23 0.53 0.13 0.07 0.06 0.23 0.36 0.1012 0.05 0.09 0.26 3.62 0.94 0.05 0.09 0.28 0.53 0.16 0.05 0.09 0.27 0.43 0.1313 0.05 0.09 0.30 3.51 0.90 0.06 0.07 0.33 0.50 0.12 0.05 0.07 0.33 0.46 0.1014 0.05 0.08 0.28 4.57 1.19 0.05 0.07 0.27 0.74 0.15 0.05 0.07 0.26 0.74 0.1515 0.06 0.06 0.25 4.77 1.22 0.07 0.07 0.28 0.87 0.18 0.07 0.07 0.26 0.64 0.1316 0.06 0.07 0.21 4.22 1.09 0.07 0.06 0.22 0.67 0.15 0.07 0.06 0.22 0.50 0.1117 0.05 0.07 0.24 4.60 1.17 0.07 0.08 0.26 0.71 0.15 0.06 0.08 0.25 0.59 0.1118 0.04 0.07 0.29 6.80 1.73 0.06 0.08 0.35 1.38 0.28 0.05 0.08 0.33 1.36 0.2719 0.06 0.07 0.16 2.19 0.55 0.07 0.06 0.17 0.60 0.15 0.07 0.06 0.17 0.59 0.1620 0.05 0.09 0.20 3.76 0.94 0.06 0.08 0.22 0.54 0.13 0.05 0.09 0.22 0.47 0.1021 0.04 0.08 0.18 1.94 0.49 0.05 0.07 0.17 0.58 0.16 0.05 0.07 0.17 0.54 0.1422 0.05 0.09 0.18 1.20 0.33 0.05 0.08 0.17 0.83 0.24 0.05 0.08 0.17 0.78 0.2223 0.05 0.08 0.22 2.96 0.77 0.05 0.08 0.20 0.55 0.13 0.05 0.08 0.19 0.48 0.1224 0.04 0.07 0.22 5.55 1.40 0.04 0.09 0.30 0.81 0.15 0.04 0.09 0.29 0.75 0.1225 0.06 0.11 0.31 3.44 0.90 0.05 0.11 0.31 0.70 0.13 0.06 0.11 0.30 0.45 0.0826 0.07 0.07 0.17 1.26 0.32 0.07 0.07 0.17 0.73 0.19 0.07 0.07 0.17 0.75 0.2027 0.05 0.10 0.31 3.35 0.87 0.04 0.10 0.31 0.58 0.12 0.04 0.10 0.30 0.52 0.11

Mean 0.05 0.08 0.22 3.37 0.87 0.06 0.08 0.23 0.68 0.16 0.06 0.08 0.23 0.59 0.14

pearson educational measurement

Documents

an investigation of teacher educational measurement

educational and psychological measurement kindermann... ·...

educational psychology - pearson

chinese/english journal of educational measurement and

educational measurement and evaluation guru k moorthy

tennessee comprehensive assessment programdeveloped and...

thorndike - educational psychology - measurement of mental...

the center for educational measurement and evaluationthe...

educational and psychological measurement - … ·...

educational and psychological measurement · members of...

the center for educational measurement and evaluationthe

educational measurement, assessment and evaluation

educational measurement and evaluation (1)

uncertainty in measurement © 2012 pearson education, inc

201 the center for educational measurement and...

the center for educational measurement and evaluationthe...

a quantum measurement paradigm for educational predicates...

class set-up prof. russell lewis educational consultant...

educational measurement

washington assessment of student learning grade 3 2006...