investigating common-item screening procedures in...

Investigating Common-Item Screening Procedures in Developing a Vertical Scale Annual Meeting of the National Council of Educational Measurement New Orleans, LA

Marc Johnson

Qing Yi April 2011

COMMON-ITEM SCREENING IN A VERTICAL SCALE

1

Abstract

Creating a vertical scale involves several decisions on assessment designs and statistical analyses

to determine the most appropriate vertical scale. This research study aims at investigating

common item stability check procedures to arrive at vertical linking item sets that will produce

the necessary constants for computing vertical theta (ability) estimates and scale scores on a

vertical scale metric. The research reported in this paper, investigates the phenomenon of

common items (across adjacent levels) that have lower difficulty estimates (“easier”) at the lower

level than at the upper level and the subsequent vertical scales. A major finding of this research

is that the presence of linking items that appeared to be easier at the lower level than at the upper

level can still lead to patterns of increasing achievement growth from the lowest level of the

scale to the highest level.


2

Investigating Common-Item Screening Procedures in Developing a Vertical Scale

Introduction

Vertical scaling is a process of placing scores on tests that measure similar constructs, but

at different educational levels, onto a common scale (Kolen & Brennan, 2004). Vertical scales,

therefore, are thought of as a progression of scale scores used to monitor academic achievement

across age/grade levels (hereafter, levels). The need for vertical scales has received much

attention in the recent decade due to No Child Left Behind (NCLB) requiring that assessment

programs track academic progress. Despite the prevalence of vertical scales in assessment, at

both the national and state levels of assessment programs, the methodologies to derive vertical

scales are numerous and often can produce different results.

In deriving vertical scales, practitioners often must choose among scaling methodologies

(e.g., item response theory (IRT), Thurstone scaling), vertical linking strategies across levels

(e.g., concurrent, separate level-groups, level-by-level), and scaling designs (e.g., scaling test,

common items across levels, equivalent groups design). There are other factors that should be

considered when designing a vertical scale and there are studies that are devoted to analyzing

these factors and how various combinations of these factors affect resulting vertical scales (Ito,

Sykes, & Yao, 2008; Tong & Kolen, 2007). These research studies of vertical scaling have not

provided clear guidance on what factors should be used in combination to produce the “best”

vertical scales. However, it is often that practitioners of vertical scales, or those interested in

designing them, derive appropriate vertical scales by analyzing how combinations of these

factors affect the vertical scales in relation to the expectation of growth within individual

assessment programs.


3

One factor that deserves more attention in vertical scaling is the set of items that

ultimately is used to create the vertical link among levels. In other words, vertical scales are

created via a set of items, regardless of the scaling design, that are responded to by examinees of

differing levels. In the case of the common item approach, vertical linking items are assessed

within on-level test forms as well as within off-level test forms. Within the equivalent groups

design, examinees can be randomly assigned to respond to either an on-level test or an off-level

test. However, with a scaling test design, examinees respond to a “test” that consists of all

vertical linking items, across all levels. The scaling test is in addition to an on-level test from

which scores are linked to the scaling test. In practice, examinee performance on the vertical

linking items is compared between the off-level and on-level examinees. This comparison can

result in items being removed from the vertical linking item set prior to the construction of

vertical scales (analogous to common item screening in horizontal equating).

Common item screening methodologies used in vertical linking studies can be the same

procedures found in horizontal equating strategies (e.g., Robust Z analysis, perpendicular

distance). However, the assumptions of item instability are different in the vertical linking

context from those of conventional horizontal equating practices. In other words, in the context

of vertical linking, it is expected that the vertical linking items will exhibit a differential in

performance between on-level and off-level examinees whereas that expectation is irrelevant in

horizontal equating studies. Therefore, this does raise the question of whether or not the common

item screening methodologies used in horizontal equating are appropriate within vertical linking

contexts. Should items be removed at all in vertical linking studies when a differential in

performance between on-level and off-level examinees exists?


4

The research interest expressed in this paper involves examining common item screening

methodologies for vertical linking items and the impact of removal decisions on vertical scales.

In other words, this study will investigate different procedures of adjusting vertical linking item

sets and how these decisions affect resulting vertical scales. It has already been stated that there

is some item performance differential expected in vertical linking studies, but this study will

investigate varying degrees of this expectation and justifiable decisions that can be made based

on the empirical differential in item performance.

Linking Items in Equating

In practice, horizontal equating - statistically placing a test form onto a particular

measurement scale - is often times accomplished through a set of items designated as linking

items. When a test form is being placed onto the measurement scale of another test form, the

linking items are those items that are common to both test forms. However, when a test form is

being placed onto the measurement scale of an item pool, the linking items can either be all

scored test items or a set of the items. In either situation, a measurement link is established that

allows a test form to be placed onto the same scale as a previous test form or the item pool.

The selection of linking items, in the case of only representing a set of the tested items,

has been considered critical to the design of horizontal equating studies and guidelines have been

established that are continued to be used in the psychometric analyses within large scale

assessment programs. These guidelines include test content representation relative to the entire

test form, the position of the linking items throughout the test, the number of linking items in

relation to the total number of test forms, and the statistical properties of the intended linking

items usually based on past performance. Although important, the dissection of these guidelines

is beyond the scope of this research study, but readers are referred to texts that discuss these


5

guidelines in more detail (Klein & Jarjoura, 1985; Wingersky, Cook, & Eignor, 1987).

Vertical linking can be accomplished from a variety of methods. One method is through

the use of linking items, analogous to horizontal equating. When used as common items across

adjacent levels, vertical linking item sets will mostly consist of items that students at the adjacent

levels can respond to correctly. Linking item guidelines of horizontal equating, mentioned above,

are applicable in the vertical linking context so that a strong measurement link can be established

that will foster a reasonable scale of growth across all levels. The scaling test method of vertical

linking, however, relies on examinees responding to an on-level test as well as a test that consists

of items spanning all levels (the scaling test; Kolen & Brennan, 2004).

Linking Item Performance in Equating – Stability Check Procedures

When using linking items to determine a measurement link between test forms or between

a test form and an item pool, the item statistics are analyzed and compared between previous

item statistics and newly obtained statistics. Under Rasch, the IRT statistics can be compared

through the use of procedures such as the Robust Z analysis (Huynh, Gleaton, & Seaman, 1992),

perpendicular distances mentioned earlier, as well as the 0.3-logit difference procedure (Miller,

Rotou, & Twing, 2004). All of these procedures (discussed below), referred to as item stability

checks, aim at identifying the items that show a greater than expected difference between the old

and new statistics, each with its own criteria of acceptable difference. In practice, the items

identified at this stage are considered to be removed from the linking item set before the final

measurement link is establish and the scaling of raw scores to scale scores. However, there are

guidelines around how common items are removed from the linking item set for each procedure.

Robust Z Statistic

The Robust Z statistic is determined through the following formula:


6

)74.0(

])[( 21

IQR

Mbbz dii ,

where is one difficulty estimate value for a given linking item, is the other estimated item

difficulty for that linking item, is the median difference of all potential linking items, and

IQR is the interquartile range of the difference for all potential linking items. In contrast,

traditional z statistics are computed as z = (score-mean)/standard deviation, which can be

affected by outliers. The Robust Z statistic was designed to be “robust” in its calculation against

outliers. Also, evaluating the Robust Z statistic – against a predetermined value - alone does not

provide the mechanism for removing “unstable” linking items. This procedure incorporates the

ratio of standard deviations and the correlation of the two sets of item difficulty estimates to

determine if linking items should be dropped. The full procedure is outlined in Appendix A.

1ib 2ib

dM

0.3-Logit Difference

This procedure identifies items that have an absolute difference between item difficulty

estimates of 0.3 logit or greater. These items are considered to be removed from the linking item

set – following standard guidelines for removing items.

Perpendicular Distance

Based on the delta-plot method (Angoff, 1972; Dorans & Holland, 1993) of item

difficulty differences, the perpendicular distance procedure evaluates the standard deviation of

the perpendicular distance to the line of best fit. Although this method has been applied to

differences in proportion correct values (item p-values; Karkee & Choi, 2005), the research study

presented in this paper uses this method to evaluate differences in Rasch item difficulty values.

Also, the computation of the statistics for this procedure are slightly different from what was


7

presented by Karkee and Choi, based on the application of this procedure to equating studies for

large-scale assessment programs. As computed here, the perpendicular distance is:

1

][2

21

A

BIAID ,

where I signifies the item difficulty estimates;2112

22

21

212

221

22

21

22

2

4)()(

r

rA

which

includes the variances, ( and ), standard deviations ( and ), correlation ( ) and

squared-correlation ( ) of the item difficulty sets; and

21

22 1

2

2

1

12r

212r AB which includes the means

of the item difficulty sets ( 1 and 2 ). For this research study, the perpendicular distance for

each linking item is transformed into a z-value by D

DD

z where D is the mean distance

and D is the standard deviation of the distance. From this, any linking item with a z-value

greater than 3.0 is removed. It should be pointed out, though, that linking items to be removed

with this procedure are removed one at a time, leading to a recalculation of distance estimates for

the remaining linking items after each removal.

Linking Item Performance in Equating – Horizontal vs. Vertical Linking For horizontal equating, there is often the expectation that the linking items will perform

similarly to their most recent test administration so that the item stability checks should not result

in any linking items being dropped prior to the scaling of raw scores. However, the unpredictable

nature of testing and student responses can result in items showing large differences in

performance across test administrations. Therefore, the stability checks will result in items being

dropped from the linking set so that these items will not impact the measurement link that is

being sought for the purposes of equating.


8

Vertical linking, however, presents a slight challenge to the idea of item stability used in

horizontal equating. In vertical linking, the expectation of common items is that items presented

to students of two different (mainly, adjacent) levels will appear easier at the higher level than at

the lower level. This can be summarized as items performing better at the higher level than at the

lower level. Item stability checks are appropriate in this situation, though, to monitor these

differences and investigate closely those differences that are greater than expected. Therefore, as

with horizontal equating, stability procedures can result in vertical linking items being removed

from the analysis prior to a vertical link being established.

Within vertical linking, though, it can be found that items used as vertical links can display

better performance at the lower level than at the higher level. In other words, the items are easier

for the lower level students than for the higher level students. There are several reasons this may

occur, but the relevance of this phenomenon to the current paper is the appropriate handling of

these instances in creating a vertical scale. From this, a dilemma is introduced in creating vertical

scales since the goal of these scales is to show a progression of achievement from one level

(level 1) through another (e.g., level 6). Anomalies in item performance between two adjacent

levels may limit the perception of the progression of achievement.

Purpose of Research

One aspect, in particular, that was analyzed is when vertical linking items show better

performance at a lower level than at a higher level, to the degree that linking items were removed

from the linking set. The expectation is that items presented at a higher level will result in lower

item difficulty estimates than when those items are administered at a lower level. Those items

should appear easier at the higher level than at the lower level. However, it can be found that

items may perform better at a lower level than at the higher grade level. This may affect the


9

construction of vertical scales since the goal of these scales is to show a progression of

achievement from one level (e.g., level 1) through another (e.g., level 6).

Anomalies in item performance between two levels may limit the progression of

achievement. This situation is discussed because it can be the case that typical common item

screening methodologies may not lead practitioners to discard items that perform better at a

lower level from the vertical linking set, thus leaving an item in the linking set that does not fully

comply with expectations. This research study investigated various common item screening

methodologies to determine linking item sets that will lead to the development of vertical scales.

The primary goal of this research study was to analyze the pattern of vertical scales among the

different common item screening methodologies, noting how anomalies in item performance

across adjacent levels affect the trajectories of “growth”.

Method

Student data from a large scale assessment program was used in this study. This data,

obtained through the common-item non-equivalent-groups design (Young, 2006; Kolen &

Brennan, 2004), reflected student responses to on-level test forms that included off-level items

(“vertical linking items”) according to the design shown in Table 1. As shown in Table 1, the

off-level items for level 1 were from the level 2 test only while the level 6 test included off-level

items from only the level 5 test. However, for levels 2 through 5, each test included off-level

items from one level above and one level below. With this design, 36 items were classified as

vertical linking items among adjacent levels.

All items were calibrated to the Rasch measurement model through WINSTEPS software

(Linacre, 2007). The non-linking items were calibrated first, and then used as anchors for the

calibration of the vertical linking items. The Rasch item difficulty estimates of the common


10

items were analyzed across adjacent levels and differences between these estimates were

examined through multiple item screening procedures: Robust Z, perpendicular distance, and

0.3-logit difference. The goal of this screening investigation was to identify vertical linking items

that (1) show a substantial differential in examinee performance across adjacent levels, noting (2)

the occurrence of items performing better at the lower level than at the higher level and how

inclusion of these items affect the vertical scales.

This research study used two conditions of linking item removal: directional and non-

directional. The directional approach removed only those linking items that were, in fact, easier

at the lower level than at the higher level while the non-directional approach removed linking

items based on the results of the item stability procedures, regardless of whether or not the items

were easier at the lower level. Also, the maximum number of linking items that could be

removed within any research condition was set at 7, approximately 20% of the original linking

item set. This maximum percentage is widely used in practice.

From the item screening methodologies, vertical linking constants were computed as the

difference in mean Rasch difficulty estimates between two adjacent levels. These constants were

added cumulatively to on-level theta estimates (obtained through WINSTEPS) using level 3 as

the base scale. For example, for level 1, the vertical linking constant computed between level 1

and level 2 and the vertical linking constant between level 2 and level 3 were added to each theta

estimate in level 1, placing all level 1 theta estimates onto the scale of level 3. From the adjusted

theta estimates, vertically linked scale scores (reportable scores on the vertical scale) were

derived using a procedure outlined by Kolen and Brennan (2004) which uses a unique slope and

intercept for each level.

Using this approach, slopes were determined by


11

12

12 )()(

yy

yscysc

,

where sc(y2) is the desired scale score mean for the highest level of the vertical scale (e.g., level

6) , sc(y1) is the desired scale score mean for the lowest level of the vertical scale (e.g., level 1),

y2 is the vertically linked ability estimate for the highest level corresponding to a cumulative

percent of 75% (of the population of ability estimates) whereas y1 is the vertically linked ability

estimate corresponding to a cumulative percent of 75% for the lowest level. For this research

study, 250 was used as the desired mean scale score for level 6 while 200 was used for level 1, as

proposed by the authors of this approach.

Intercepts under this approach were determined by

)()()(

)( 112

121 y

yy

yscyscysc

,

where the terms are defined as they were for computing the slopes.

Evaluation

For each item screening methodology, descriptive statistics of the vertical scale scores

were plotted across levels to display average performance across levels and effect sizes were

computed and plotted across levels to provide an index of the separation of scale score

distributions among adjacent levels. From Kolen and Brennan (2004), the effect size index was

computed as follows:

2/))()((

)()(22

lowerupper

lowerupper

YY

YYes

,


12

where μ(Y)upper is the mean scale score for the upper level, μ(Y)lower is the mean scale score of the

lower level, σ2(Y)upper is the variance for the upper level, and σ2(Y)lower is the variance of the

lower level. Also, the vertical linking item sets were compared across the item screening

methodologies to analyze the number of items used to create the vertical link that, in fact,

performed better at the lower level than at the higher level.

Results

Table 2 shows the number of items removed through each item stability screening

procedure within each research condition, the number of items retained that were easier at the

lower level, and the vertical linking constants computed with the final linking item sets. The

directional and non-directional approached resulted in vertical linking items that were easier at

the lower level, but were kept for computation of the vertical linking constants. In the cases

where the number of linking items removed was less than 7 – the maximum allowed – the

remaining linking items were not flagged as “problematic” within the item stability check

procedures. However, in the cases where seven linking items were removed from the linking set,

some of the remaining linking items were flagged as “problematic”, but could not be removed

for violating the maximum allowed for removal.

It should be pointed out that the perpendicular distance procedure for both research

conditions (directional and non-directional) resulted in the same vertical level linking constants.

In both of these cases, no linking items were flagged to be removed across each level. This result

will manifest itself throughout the rest of the results of this study and will be discussed again in

the conclusion section of this paper.

Table 3 shows the average of the theta estimates after the vertical linking constants have

been applied as previously outlined. From these results, a few things are worth pointing out that


13

will manifest themselves throughout the rest of the results. First, the mean theta estimates

increase from level 1 to level 6 for all conditions/procedures except for the non-directional

approach with the 0.3-logit difference procedure. In this condition, the average theta estimate for

level 3 (the base level) is slightly lower than that of level 2 and the average theta estimate for

level 5 is lower than that of level 4.

Second, and with the exception of the perpendicular distance procedure, the mean theta

estimates for each item stability check procedure under the non-directional condition comprise a

smaller range and are smaller in magnitude than those under the directional condition. Third, the

mean theta estimates for the 0.3-logit difference procedure under the non-directional condition

resulted in the smallest (in magnitude) mean theta estimates with a level 6 mean theta of 0.5328

which – while being the highest value of all levels within this procedure – is the smallest mean

theta estimate for level 6 across all research conditions.

Table 4 shows the slopes and intercepts derived for each study condition. The slope and

intercept values for the 0.3-logit difference procedure under the non-directional approach of

removing linking items were much higher than the other values. This was due to the fact that the

vertical linking theta estimates that corresponded to a cumulative percent of 75% for level 1

(used in the calculation, but not presented in this paper) were much higher for this condition than

the others – a product of the vertical level linking constants from Table 2.

Table 5 shows the mean scale scores after theta transformation using the derived slopes

and intercepts. Figure 1 shows the mean scale scores across levels for each condition/procedure.

It should be noted that there were some negative scale scores as the minimum value, especially

for the Robust Z and 0.3-logit difference procedures under the non-directional condition which

had negative scale scores as the minimum for each level. From the results of the mean theta


14

estimates, the mean scale scores for the non-directional condition are lower than those of the

directional condition, with the lowest mean scale scores from the 0.3-logit difference procedure.

Table 6 and Figure 2 show the standard deviations of scale scores across levels for each

condition/procedure. Here, the standard deviations from the non-directional condition were much

higher than those of the directional condition. Plus, the 0.3-logit different procedure under the

non-directional approach resulted in standard deviations that were approximately five times

greater than those from the directional approach, indicating greater variability among the scale

scores.

As mentioned earlier, effect sizes provide an index of the separation of scale score

distributions among adjacent levels. Table 7 and Figure 3 depict the effects sizes computed for

this research study. As the figure shows, the pattern of effect sizes is relatively consistent across

conditions/procedures, but with large “jumps” from the effect size between level 4 and level 5

and that between level 5 and level 6. The majority of the effect sizes can be considered “small”,

equal to or greater than 0.2 but less than 0.5 (Cohen, 1988). However, the 0.3-logit procedure

under the non-directional condition resulted in some “negligible” effect sizes, less than 0.2,

essentially indicating no separation of scale score distributions between adjacent levels.

Discussion

As can be seen throughout the results of this research study, the strategy for removing

linking item during stability checks (directional vs. non-directional) as well as the item stability

screening procedure itself affects the items removed and, subsequently, the linking constants

obtained for developing a vertical scale. More importantly, the results from Table 2 indicate that

each vertical scale created within this research study was done so with several vertical linking

items that were easier at the lower level than at the upper level. The data used for this study


15

provided an invaluable opportunity to investigate how this item performance phenomenon

manifests itself in vertical scales.

In looking at Figure 1, the mean scale scores increase from level 1 to level 6 for all

research conditions except for the 0.3-logit difference procedure under the non-directional

approach for removing linking items. The vertical trend for this research condition shows “no

growth” from level 1 to level 2 and shows an unexpected “decrease” growth from level 4 to level

5 followed by a large growth from level 5 to level 6. The variance of scale scores for this

stability screening procedure is also much higher than the other research conditions. It can be

inferred from the results of this study that the 0.3-logit difference procedure – at least under the

non-directional approach – was affected by the presence of linking items that were easier at the

lower level than at the upper level, given that some “unstable” items had been removed that were

easier at the upper level. The “negligible” effect sizes from the 0.3-logit difference procedure

under the non-directional approach provide further evidence of the pattern “growth” shown in

Figure 1. This inference of the effect of the linking item set for this research condition might

cause some concern among practitioners for considering this option for adopting to create a

vertical scale.

Limitations

Although the results of this study provide promising applications to future vertical scale

development, there are limitations worth mentioning. First, this data was not compared against

linking item sets in which items common to adjacent levels are easier at the upper level and more

difficult at the lower level – a general expectation of common-item performance across adjacent

levels. This comparison would shed more insight into whether this phenomenon is a major

concern to consider when developing a vertical scale through assessment items. Another


16

limitation of this research study is the criteria for identifying and removing “unstable” linking

items. As was previously mentioned, the criteria used for this study reflected criteria used for

linking studies performed for operational large-scale assessment programs, which is sometimes

different from original published criteria. A third limitation of the study is that only one data set

was included in the research which may limit the generalizability of the results.


17

References

Angoff, W.H. (1972, September). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillside, NJ: Erlbaum. Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and

Standardization. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Huynh, H., Gleaton, J., and Seaman, S.P. (1992) Technical documentation for the South

Carolina high school exit examination of reading and mathematics: Paper No. 2 (2nd ed.). Columbia, SC: University of South Carolina, College of Education.

Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures

for vertical scaling. Applied Measurement in Education, 21, 187-206. Karkee, T., & Choi, S. (2005, April). Impact of eliminating anchor items flagged from statistical

criteria on test score classifications in common item equating. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal.

Klein, L. W. & Jarjoura, D. (1985). The importance of content representation for common-item

equating with non-random groups. Journal of Educational Measurement, 22, 197-206. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and

practices (2nd Edition). New York: Springer-Verlag. Linacre, J.M. (2007). A User's Guide to W I N S T E P S® M I N I S T E P Rasch-Model

Computer Programs. Chicago, IL. Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the .3 logits

screening criterion in common item equating. Journal of Applied Measurement, 5(2), 172-177.

Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling

for educational achievement tests. Applied Measurement in Education, 20(2), 227-253. Wingersky, M. S., Cook, L. L., & Eignor, D. R. (1987). Specifying the characteristics of linking

items used for item response theory item calibration. ETS Research Report 87-24. Princeton NJ: Educational Testing Service.

Young, M. J. (2006). Vertical scaling. In S.M. Downing and T. M. Haladyna (Eds.), Handbook

of test development. Mahwah, NJ: Lawrence Erlbaum Associates.

Table 1. Common Item Design for Developing a Vertical Scale

Level 3 on-level Level 4

off-level

Level 3 off-level


off-level

Level 4 off-level


off- Leve l

Level 5 off-level


off-level

Level 6 off-level


off-level

Level 7 off-level

Level 8 on-level

COMMON-ITEM SCREENING IN A VERTICAL SCALE 19

Table 2. Removed/Retained Linking Items and Vertical Linking Constants Condition Procedure Level Items Removed Items Kept

(performed better at lower level) Vertical Level Linking Constant

1-2 1 5 -0.8403 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290

Robust Z

5-6 1 26 -0.3651 1-2 0 6 -0.7953 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290


5-6 0 27 -0.3976 1-2 2 4 -0.8776 2-3 5 5 -1.0231 3-4 7 11 0.3176 4-5 1 4 0.7700

Directional


5-6 7 20 -0.2176 1-2 4 5 -0.6669 2-3 5 10 -0.4665 3-4 1 18 -0.0090 4-5 2 5 0.6323

Robust Z

5-6 4 26 -0.4638 1-2 0 6 -0.7953 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290


5-6 0 27 -0.3976 1-2 7 6 -0.4772 2-3 7 10 -0.3760 3-4 7 16 -0.1311 4-5 7 5 0.4556

Non-Directional


5-6 7 20 -0.2176

Table 3. Mean Vertical Linking Theta Estimates Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6

Robust Z -1.1362 -0.5780 -0.1589 0.3100 0.5304 0.8510

Perpendicular Distance -1.0912 -0.5780 -0.1589 0.3100 0.5304 0.8185 Directional

0.3-Logit Difference -1.3961 -0.8006 -0.1589 0.5664 0.8278 1.2959

Robust Z -0.6288 -0.2440 -0.1589 0.2398 0.3635 0.5854

Perpendicular Distance -1.0912 -0.5780 -0.1589 0.3100 0.5304 0.8185 Non-Directional

0.3-Logit Difference -0.3186 -0.1535 -0.1589 0.1177 0.0647 0.5328

Table 4. Slope and Intercept Values for Scale Transformation Condition Procedure Slope Intercept

Robust Z 32.6158 200.3098 Perpendicular Distance 34.3525 198.7805 Directional

0.3-Logit Difference 22.3434 206.0193

Robust Z 65.7895 167.2434 Perpendicular Distance 34.3525 198.7805 Non-Directional

0.3-Logit Difference 125.8812 98.2754

Table 5. Mean Vertical Linking Scale Scores Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6

Robust Z 178 190 209 217 225 231 Perpendicular Distance 177 188 208 217 224 230 Directional

0.3-logit Difference 185 194 212 223 229 237

Robust Z 156 168 184 197 206 212 Perpendicular Distance 177 188 208 217 224 230 Non-Directional


Table 6. Standard Deviation of Vertical Linking Scale Scores Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6

Robust Z 34 31 31 30 33 27 Perpendicular Distance 36 33 33 31 34 28 Directional


Robust Z 69 63 62 60 66 54 Perpendicular Distance 36 33 33 31 34 28 Non-Directional



21

Table 7. Vertical Scale Effect Sizes

Condition Procedure Level 1/ Level 2

Level 2/ Level 3

Level 3/ Level 4

Level 4/ Level 5

Level 5/ Level 6

Robust Z 0.36 0.61 0.29 0.23 0.21 Perpendicular Distance 0.31 0.61 0.29 0.23 0.18 Directional

0.3-logit Difference 0.39 0.85 0.55 0.28 0.37

Robust Z 0.19 0.26 0.21 0.13 0.11 Perpendicular Distance 0.31 0.61 0.29 0.23 0.18 Non-Directional

0.3-logit Difference 0.00 0.16 0.08 -0.05 0.37

0

50

100

150

200

250

1 2 3 4 5 6

Level

Mea

n S

cale

Sco

re Directional: Robust Z

Directional: Perpendicular Distance

Directional: 0.3-logit Difference

NonDirectional: Robust Z

NonDirectional: Perpendicular Distance

NonDirectional: 0.3-logit Difference

Figure 1. Mean Vertical Linking Scale Scores


22

0

20

40

60

80

100

120

140

1 2 3 4 5 6

Level

Sta

ndar

d D

evia

tion Directional: Robust Z






Figure 2. Standard Deviation of Vertical Linking Scale Scores


23

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

L1/L2 L2/L3 L3/L4 L4/L5 L5/L6

Level

Eff

ect

Siz

e

Directional: Robust Z






Figure 3. Vertical Scale Effect Sizes

Appendix A

Robust Z Stability Check Guidelines

1. Calculate the mean and standard deviation for both sets of item difficulties for all linking items.

2. Calculate the ratio of standard deviations. 3. Calculate the correlation between the sets of item difficulties.

4. Calculate the robust Z statistic for each linking item and flag all linking items with an

absolute value of the robust Z greater than 1.645.

5. The ratio of the standard deviations (from step 2) must be between 0.9 and 1.1.

6. The correlation (from step 3) must be at least 0.95.

7. If the ratio of standard deviations or correlation is outside of the prescribed bounds, then remove the item whose absolute robust Z value is the largest and is greater than 1.645).

8. Recompute the ratio of standard deviations and correlation with the remaining linking

items. 9. Continue dropping items in a stepwise fashion until the ratio of standard deviations and

correlation are within the prescribed bounds, there are no items left with a robust Z greater than 1.645, or 20% of the linking set has been dropped. Note that the Robust Z values are not recalculated each time, only the ratio of standard deviations and correlation.

investigating common-item screening procedures in...

Documents