investigating common-item screening procedures in...
TRANSCRIPT
Investigating Common-Item Screening Procedures in Developing a Vertical Scale Annual Meeting of the National Council of Educational Measurement New Orleans, LA
Marc Johnson
Qing Yi April 2011
COMMON-ITEM SCREENING IN A VERTICAL SCALE
1
Abstract
Creating a vertical scale involves several decisions on assessment designs and statistical analyses
to determine the most appropriate vertical scale. This research study aims at investigating
common item stability check procedures to arrive at vertical linking item sets that will produce
the necessary constants for computing vertical theta (ability) estimates and scale scores on a
vertical scale metric. The research reported in this paper, investigates the phenomenon of
common items (across adjacent levels) that have lower difficulty estimates (“easier”) at the lower
level than at the upper level and the subsequent vertical scales. A major finding of this research
is that the presence of linking items that appeared to be easier at the lower level than at the upper
level can still lead to patterns of increasing achievement growth from the lowest level of the
scale to the highest level.
COMMON-ITEM SCREENING IN A VERTICAL SCALE
2
Investigating Common-Item Screening Procedures in Developing a Vertical Scale
Introduction
Vertical scaling is a process of placing scores on tests that measure similar constructs, but
at different educational levels, onto a common scale (Kolen & Brennan, 2004). Vertical scales,
therefore, are thought of as a progression of scale scores used to monitor academic achievement
across age/grade levels (hereafter, levels). The need for vertical scales has received much
attention in the recent decade due to No Child Left Behind (NCLB) requiring that assessment
programs track academic progress. Despite the prevalence of vertical scales in assessment, at
both the national and state levels of assessment programs, the methodologies to derive vertical
scales are numerous and often can produce different results.
In deriving vertical scales, practitioners often must choose among scaling methodologies
(e.g., item response theory (IRT), Thurstone scaling), vertical linking strategies across levels
(e.g., concurrent, separate level-groups, level-by-level), and scaling designs (e.g., scaling test,
common items across levels, equivalent groups design). There are other factors that should be
considered when designing a vertical scale and there are studies that are devoted to analyzing
these factors and how various combinations of these factors affect resulting vertical scales (Ito,
Sykes, & Yao, 2008; Tong & Kolen, 2007). These research studies of vertical scaling have not
provided clear guidance on what factors should be used in combination to produce the “best”
vertical scales. However, it is often that practitioners of vertical scales, or those interested in
designing them, derive appropriate vertical scales by analyzing how combinations of these
factors affect the vertical scales in relation to the expectation of growth within individual
assessment programs.
COMMON-ITEM SCREENING IN A VERTICAL SCALE
3
One factor that deserves more attention in vertical scaling is the set of items that
ultimately is used to create the vertical link among levels. In other words, vertical scales are
created via a set of items, regardless of the scaling design, that are responded to by examinees of
differing levels. In the case of the common item approach, vertical linking items are assessed
within on-level test forms as well as within off-level test forms. Within the equivalent groups
design, examinees can be randomly assigned to respond to either an on-level test or an off-level
test. However, with a scaling test design, examinees respond to a “test” that consists of all
vertical linking items, across all levels. The scaling test is in addition to an on-level test from
which scores are linked to the scaling test. In practice, examinee performance on the vertical
linking items is compared between the off-level and on-level examinees. This comparison can
result in items being removed from the vertical linking item set prior to the construction of
vertical scales (analogous to common item screening in horizontal equating).
Common item screening methodologies used in vertical linking studies can be the same
procedures found in horizontal equating strategies (e.g., Robust Z analysis, perpendicular
distance). However, the assumptions of item instability are different in the vertical linking
context from those of conventional horizontal equating practices. In other words, in the context
of vertical linking, it is expected that the vertical linking items will exhibit a differential in
performance between on-level and off-level examinees whereas that expectation is irrelevant in
horizontal equating studies. Therefore, this does raise the question of whether or not the common
item screening methodologies used in horizontal equating are appropriate within vertical linking
contexts. Should items be removed at all in vertical linking studies when a differential in
performance between on-level and off-level examinees exists?
COMMON-ITEM SCREENING IN A VERTICAL SCALE
4
The research interest expressed in this paper involves examining common item screening
methodologies for vertical linking items and the impact of removal decisions on vertical scales.
In other words, this study will investigate different procedures of adjusting vertical linking item
sets and how these decisions affect resulting vertical scales. It has already been stated that there
is some item performance differential expected in vertical linking studies, but this study will
investigate varying degrees of this expectation and justifiable decisions that can be made based
on the empirical differential in item performance.
Linking Items in Equating
In practice, horizontal equating - statistically placing a test form onto a particular
measurement scale - is often times accomplished through a set of items designated as linking
items. When a test form is being placed onto the measurement scale of another test form, the
linking items are those items that are common to both test forms. However, when a test form is
being placed onto the measurement scale of an item pool, the linking items can either be all
scored test items or a set of the items. In either situation, a measurement link is established that
allows a test form to be placed onto the same scale as a previous test form or the item pool.
The selection of linking items, in the case of only representing a set of the tested items,
has been considered critical to the design of horizontal equating studies and guidelines have been
established that are continued to be used in the psychometric analyses within large scale
assessment programs. These guidelines include test content representation relative to the entire
test form, the position of the linking items throughout the test, the number of linking items in
relation to the total number of test forms, and the statistical properties of the intended linking
items usually based on past performance. Although important, the dissection of these guidelines
is beyond the scope of this research study, but readers are referred to texts that discuss these
COMMON-ITEM SCREENING IN A VERTICAL SCALE
5
guidelines in more detail (Klein & Jarjoura, 1985; Wingersky, Cook, & Eignor, 1987).
Vertical linking can be accomplished from a variety of methods. One method is through
the use of linking items, analogous to horizontal equating. When used as common items across
adjacent levels, vertical linking item sets will mostly consist of items that students at the adjacent
levels can respond to correctly. Linking item guidelines of horizontal equating, mentioned above,
are applicable in the vertical linking context so that a strong measurement link can be established
that will foster a reasonable scale of growth across all levels. The scaling test method of vertical
linking, however, relies on examinees responding to an on-level test as well as a test that consists
of items spanning all levels (the scaling test; Kolen & Brennan, 2004).
Linking Item Performance in Equating – Stability Check Procedures
When using linking items to determine a measurement link between test forms or between
a test form and an item pool, the item statistics are analyzed and compared between previous
item statistics and newly obtained statistics. Under Rasch, the IRT statistics can be compared
through the use of procedures such as the Robust Z analysis (Huynh, Gleaton, & Seaman, 1992),
perpendicular distances mentioned earlier, as well as the 0.3-logit difference procedure (Miller,
Rotou, & Twing, 2004). All of these procedures (discussed below), referred to as item stability
checks, aim at identifying the items that show a greater than expected difference between the old
and new statistics, each with its own criteria of acceptable difference. In practice, the items
identified at this stage are considered to be removed from the linking item set before the final
measurement link is establish and the scaling of raw scores to scale scores. However, there are
guidelines around how common items are removed from the linking item set for each procedure.
Robust Z Statistic
The Robust Z statistic is determined through the following formula:
COMMON-ITEM SCREENING IN A VERTICAL SCALE
6
)74.0(
])[( 21
IQR
Mbbz dii ,
where is one difficulty estimate value for a given linking item, is the other estimated item
difficulty for that linking item, is the median difference of all potential linking items, and
IQR is the interquartile range of the difference for all potential linking items. In contrast,
traditional z statistics are computed as z = (score-mean)/standard deviation, which can be
affected by outliers. The Robust Z statistic was designed to be “robust” in its calculation against
outliers. Also, evaluating the Robust Z statistic – against a predetermined value - alone does not
provide the mechanism for removing “unstable” linking items. This procedure incorporates the
ratio of standard deviations and the correlation of the two sets of item difficulty estimates to
determine if linking items should be dropped. The full procedure is outlined in Appendix A.
1ib 2ib
dM
0.3-Logit Difference
This procedure identifies items that have an absolute difference between item difficulty
estimates of 0.3 logit or greater. These items are considered to be removed from the linking item
set – following standard guidelines for removing items.
Perpendicular Distance
Based on the delta-plot method (Angoff, 1972; Dorans & Holland, 1993) of item
difficulty differences, the perpendicular distance procedure evaluates the standard deviation of
the perpendicular distance to the line of best fit. Although this method has been applied to
differences in proportion correct values (item p-values; Karkee & Choi, 2005), the research study
presented in this paper uses this method to evaluate differences in Rasch item difficulty values.
Also, the computation of the statistics for this procedure are slightly different from what was
COMMON-ITEM SCREENING IN A VERTICAL SCALE
7
presented by Karkee and Choi, based on the application of this procedure to equating studies for
large-scale assessment programs. As computed here, the perpendicular distance is:
1
][2
21
A
BIAID ,
where I signifies the item difficulty estimates;2112
22
21
212
221
22
21
22
2
4)()(
r
rA
which
includes the variances, ( and ), standard deviations ( and ), correlation ( ) and
squared-correlation ( ) of the item difficulty sets; and
21
22 1
2
2
1
12r
212r AB which includes the means
of the item difficulty sets ( 1 and 2 ). For this research study, the perpendicular distance for
each linking item is transformed into a z-value by D
DD
z where D is the mean distance
and D is the standard deviation of the distance. From this, any linking item with a z-value
greater than 3.0 is removed. It should be pointed out, though, that linking items to be removed
with this procedure are removed one at a time, leading to a recalculation of distance estimates for
the remaining linking items after each removal.
Linking Item Performance in Equating – Horizontal vs. Vertical Linking For horizontal equating, there is often the expectation that the linking items will perform
similarly to their most recent test administration so that the item stability checks should not result
in any linking items being dropped prior to the scaling of raw scores. However, the unpredictable
nature of testing and student responses can result in items showing large differences in
performance across test administrations. Therefore, the stability checks will result in items being
dropped from the linking set so that these items will not impact the measurement link that is
being sought for the purposes of equating.
COMMON-ITEM SCREENING IN A VERTICAL SCALE
8
Vertical linking, however, presents a slight challenge to the idea of item stability used in
horizontal equating. In vertical linking, the expectation of common items is that items presented
to students of two different (mainly, adjacent) levels will appear easier at the higher level than at
the lower level. This can be summarized as items performing better at the higher level than at the
lower level. Item stability checks are appropriate in this situation, though, to monitor these
differences and investigate closely those differences that are greater than expected. Therefore, as
with horizontal equating, stability procedures can result in vertical linking items being removed
from the analysis prior to a vertical link being established.
Within vertical linking, though, it can be found that items used as vertical links can display
better performance at the lower level than at the higher level. In other words, the items are easier
for the lower level students than for the higher level students. There are several reasons this may
occur, but the relevance of this phenomenon to the current paper is the appropriate handling of
these instances in creating a vertical scale. From this, a dilemma is introduced in creating vertical
scales since the goal of these scales is to show a progression of achievement from one level
(level 1) through another (e.g., level 6). Anomalies in item performance between two adjacent
levels may limit the perception of the progression of achievement.
Purpose of Research
One aspect, in particular, that was analyzed is when vertical linking items show better
performance at a lower level than at a higher level, to the degree that linking items were removed
from the linking set. The expectation is that items presented at a higher level will result in lower
item difficulty estimates than when those items are administered at a lower level. Those items
should appear easier at the higher level than at the lower level. However, it can be found that
items may perform better at a lower level than at the higher grade level. This may affect the
COMMON-ITEM SCREENING IN A VERTICAL SCALE
9
construction of vertical scales since the goal of these scales is to show a progression of
achievement from one level (e.g., level 1) through another (e.g., level 6).
Anomalies in item performance between two levels may limit the progression of
achievement. This situation is discussed because it can be the case that typical common item
screening methodologies may not lead practitioners to discard items that perform better at a
lower level from the vertical linking set, thus leaving an item in the linking set that does not fully
comply with expectations. This research study investigated various common item screening
methodologies to determine linking item sets that will lead to the development of vertical scales.
The primary goal of this research study was to analyze the pattern of vertical scales among the
different common item screening methodologies, noting how anomalies in item performance
across adjacent levels affect the trajectories of “growth”.
Method
Student data from a large scale assessment program was used in this study. This data,
obtained through the common-item non-equivalent-groups design (Young, 2006; Kolen &
Brennan, 2004), reflected student responses to on-level test forms that included off-level items
(“vertical linking items”) according to the design shown in Table 1. As shown in Table 1, the
off-level items for level 1 were from the level 2 test only while the level 6 test included off-level
items from only the level 5 test. However, for levels 2 through 5, each test included off-level
items from one level above and one level below. With this design, 36 items were classified as
vertical linking items among adjacent levels.
All items were calibrated to the Rasch measurement model through WINSTEPS software
(Linacre, 2007). The non-linking items were calibrated first, and then used as anchors for the
calibration of the vertical linking items. The Rasch item difficulty estimates of the common
COMMON-ITEM SCREENING IN A VERTICAL SCALE
10
items were analyzed across adjacent levels and differences between these estimates were
examined through multiple item screening procedures: Robust Z, perpendicular distance, and
0.3-logit difference. The goal of this screening investigation was to identify vertical linking items
that (1) show a substantial differential in examinee performance across adjacent levels, noting (2)
the occurrence of items performing better at the lower level than at the higher level and how
inclusion of these items affect the vertical scales.
This research study used two conditions of linking item removal: directional and non-
directional. The directional approach removed only those linking items that were, in fact, easier
at the lower level than at the higher level while the non-directional approach removed linking
items based on the results of the item stability procedures, regardless of whether or not the items
were easier at the lower level. Also, the maximum number of linking items that could be
removed within any research condition was set at 7, approximately 20% of the original linking
item set. This maximum percentage is widely used in practice.
From the item screening methodologies, vertical linking constants were computed as the
difference in mean Rasch difficulty estimates between two adjacent levels. These constants were
added cumulatively to on-level theta estimates (obtained through WINSTEPS) using level 3 as
the base scale. For example, for level 1, the vertical linking constant computed between level 1
and level 2 and the vertical linking constant between level 2 and level 3 were added to each theta
estimate in level 1, placing all level 1 theta estimates onto the scale of level 3. From the adjusted
theta estimates, vertically linked scale scores (reportable scores on the vertical scale) were
derived using a procedure outlined by Kolen and Brennan (2004) which uses a unique slope and
intercept for each level.
Using this approach, slopes were determined by
COMMON-ITEM SCREENING IN A VERTICAL SCALE
11
12
12 )()(
yy
yscysc
,
where sc(y2) is the desired scale score mean for the highest level of the vertical scale (e.g., level
6) , sc(y1) is the desired scale score mean for the lowest level of the vertical scale (e.g., level 1),
y2 is the vertically linked ability estimate for the highest level corresponding to a cumulative
percent of 75% (of the population of ability estimates) whereas y1 is the vertically linked ability
estimate corresponding to a cumulative percent of 75% for the lowest level. For this research
study, 250 was used as the desired mean scale score for level 6 while 200 was used for level 1, as
proposed by the authors of this approach.
Intercepts under this approach were determined by
)()()(
)( 112
121 y
yy
yscyscysc
,
where the terms are defined as they were for computing the slopes.
Evaluation
For each item screening methodology, descriptive statistics of the vertical scale scores
were plotted across levels to display average performance across levels and effect sizes were
computed and plotted across levels to provide an index of the separation of scale score
distributions among adjacent levels. From Kolen and Brennan (2004), the effect size index was
computed as follows:
2/))()((
)()(22
lowerupper
lowerupper
YY
YYes
,
COMMON-ITEM SCREENING IN A VERTICAL SCALE
12
where μ(Y)upper is the mean scale score for the upper level, μ(Y)lower is the mean scale score of the
lower level, σ2(Y)upper is the variance for the upper level, and σ2(Y)lower is the variance of the
lower level. Also, the vertical linking item sets were compared across the item screening
methodologies to analyze the number of items used to create the vertical link that, in fact,
performed better at the lower level than at the higher level.
Results
Table 2 shows the number of items removed through each item stability screening
procedure within each research condition, the number of items retained that were easier at the
lower level, and the vertical linking constants computed with the final linking item sets. The
directional and non-directional approached resulted in vertical linking items that were easier at
the lower level, but were kept for computation of the vertical linking constants. In the cases
where the number of linking items removed was less than 7 – the maximum allowed – the
remaining linking items were not flagged as “problematic” within the item stability check
procedures. However, in the cases where seven linking items were removed from the linking set,
some of the remaining linking items were flagged as “problematic”, but could not be removed
for violating the maximum allowed for removal.
It should be pointed out that the perpendicular distance procedure for both research
conditions (directional and non-directional) resulted in the same vertical level linking constants.
In both of these cases, no linking items were flagged to be removed across each level. This result
will manifest itself throughout the rest of the results of this study and will be discussed again in
the conclusion section of this paper.
Table 3 shows the average of the theta estimates after the vertical linking constants have
been applied as previously outlined. From these results, a few things are worth pointing out that
COMMON-ITEM SCREENING IN A VERTICAL SCALE
13
will manifest themselves throughout the rest of the results. First, the mean theta estimates
increase from level 1 to level 6 for all conditions/procedures except for the non-directional
approach with the 0.3-logit difference procedure. In this condition, the average theta estimate for
level 3 (the base level) is slightly lower than that of level 2 and the average theta estimate for
level 5 is lower than that of level 4.
Second, and with the exception of the perpendicular distance procedure, the mean theta
estimates for each item stability check procedure under the non-directional condition comprise a
smaller range and are smaller in magnitude than those under the directional condition. Third, the
mean theta estimates for the 0.3-logit difference procedure under the non-directional condition
resulted in the smallest (in magnitude) mean theta estimates with a level 6 mean theta of 0.5328
which – while being the highest value of all levels within this procedure – is the smallest mean
theta estimate for level 6 across all research conditions.
Table 4 shows the slopes and intercepts derived for each study condition. The slope and
intercept values for the 0.3-logit difference procedure under the non-directional approach of
removing linking items were much higher than the other values. This was due to the fact that the
vertical linking theta estimates that corresponded to a cumulative percent of 75% for level 1
(used in the calculation, but not presented in this paper) were much higher for this condition than
the others – a product of the vertical level linking constants from Table 2.
Table 5 shows the mean scale scores after theta transformation using the derived slopes
and intercepts. Figure 1 shows the mean scale scores across levels for each condition/procedure.
It should be noted that there were some negative scale scores as the minimum value, especially
for the Robust Z and 0.3-logit difference procedures under the non-directional condition which
had negative scale scores as the minimum for each level. From the results of the mean theta
COMMON-ITEM SCREENING IN A VERTICAL SCALE
14
estimates, the mean scale scores for the non-directional condition are lower than those of the
directional condition, with the lowest mean scale scores from the 0.3-logit difference procedure.
Table 6 and Figure 2 show the standard deviations of scale scores across levels for each
condition/procedure. Here, the standard deviations from the non-directional condition were much
higher than those of the directional condition. Plus, the 0.3-logit different procedure under the
non-directional approach resulted in standard deviations that were approximately five times
greater than those from the directional approach, indicating greater variability among the scale
scores.
As mentioned earlier, effect sizes provide an index of the separation of scale score
distributions among adjacent levels. Table 7 and Figure 3 depict the effects sizes computed for
this research study. As the figure shows, the pattern of effect sizes is relatively consistent across
conditions/procedures, but with large “jumps” from the effect size between level 4 and level 5
and that between level 5 and level 6. The majority of the effect sizes can be considered “small”,
equal to or greater than 0.2 but less than 0.5 (Cohen, 1988). However, the 0.3-logit procedure
under the non-directional condition resulted in some “negligible” effect sizes, less than 0.2,
essentially indicating no separation of scale score distributions between adjacent levels.
Discussion
As can be seen throughout the results of this research study, the strategy for removing
linking item during stability checks (directional vs. non-directional) as well as the item stability
screening procedure itself affects the items removed and, subsequently, the linking constants
obtained for developing a vertical scale. More importantly, the results from Table 2 indicate that
each vertical scale created within this research study was done so with several vertical linking
items that were easier at the lower level than at the upper level. The data used for this study
COMMON-ITEM SCREENING IN A VERTICAL SCALE
15
provided an invaluable opportunity to investigate how this item performance phenomenon
manifests itself in vertical scales.
In looking at Figure 1, the mean scale scores increase from level 1 to level 6 for all
research conditions except for the 0.3-logit difference procedure under the non-directional
approach for removing linking items. The vertical trend for this research condition shows “no
growth” from level 1 to level 2 and shows an unexpected “decrease” growth from level 4 to level
5 followed by a large growth from level 5 to level 6. The variance of scale scores for this
stability screening procedure is also much higher than the other research conditions. It can be
inferred from the results of this study that the 0.3-logit difference procedure – at least under the
non-directional approach – was affected by the presence of linking items that were easier at the
lower level than at the upper level, given that some “unstable” items had been removed that were
easier at the upper level. The “negligible” effect sizes from the 0.3-logit difference procedure
under the non-directional approach provide further evidence of the pattern “growth” shown in
Figure 1. This inference of the effect of the linking item set for this research condition might
cause some concern among practitioners for considering this option for adopting to create a
vertical scale.
Limitations
Although the results of this study provide promising applications to future vertical scale
development, there are limitations worth mentioning. First, this data was not compared against
linking item sets in which items common to adjacent levels are easier at the upper level and more
difficult at the lower level – a general expectation of common-item performance across adjacent
levels. This comparison would shed more insight into whether this phenomenon is a major
concern to consider when developing a vertical scale through assessment items. Another
COMMON-ITEM SCREENING IN A VERTICAL SCALE
16
limitation of this research study is the criteria for identifying and removing “unstable” linking
items. As was previously mentioned, the criteria used for this study reflected criteria used for
linking studies performed for operational large-scale assessment programs, which is sometimes
different from original published criteria. A third limitation of the study is that only one data set
was included in the research which may limit the generalizability of the results.
COMMON-ITEM SCREENING IN A VERTICAL SCALE
17
References
Angoff, W.H. (1972, September). A technique for the investigation of cultural differences. Paper presented at the annual meeting of the American Psychological Association, Honolulu.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillside, NJ: Erlbaum. Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and
Standardization. In P. W. Holland, and H. Wainer (Eds.), Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Huynh, H., Gleaton, J., and Seaman, S.P. (1992) Technical documentation for the South
Carolina high school exit examination of reading and mathematics: Paper No. 2 (2nd ed.). Columbia, SC: University of South Carolina, College of Education.
Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures
for vertical scaling. Applied Measurement in Education, 21, 187-206. Karkee, T., & Choi, S. (2005, April). Impact of eliminating anchor items flagged from statistical
criteria on test score classifications in common item equating. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal.
Klein, L. W. & Jarjoura, D. (1985). The importance of content representation for common-item
equating with non-random groups. Journal of Educational Measurement, 22, 197-206. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and
practices (2nd Edition). New York: Springer-Verlag. Linacre, J.M. (2007). A User's Guide to W I N S T E P S® M I N I S T E P Rasch-Model
Computer Programs. Chicago, IL. Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the .3 logits
screening criterion in common item equating. Journal of Applied Measurement, 5(2), 172-177.
Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling
for educational achievement tests. Applied Measurement in Education, 20(2), 227-253. Wingersky, M. S., Cook, L. L., & Eignor, D. R. (1987). Specifying the characteristics of linking
items used for item response theory item calibration. ETS Research Report 87-24. Princeton NJ: Educational Testing Service.
Young, M. J. (2006). Vertical scaling. In S.M. Downing and T. M. Haladyna (Eds.), Handbook
of test development. Mahwah, NJ: Lawrence Erlbaum Associates.
Table 1. Common Item Design for Developing a Vertical Scale
Level 3 on-level Level 4
off-level
Level 3 off-level
Level 4 on-level Level 5
off-level
Level 4 off-level
Level 5 on-level Level 6
off- Leve l
Level 5 off-level
Level 6 on-level Level 7
off-level
Level 6 off-level
Level 7 on-level Level 8
off-level
Level 7 off-level
Level 8 on-level
COMMON-ITEM SCREENING IN A VERTICAL SCALE 19
Table 2. Removed/Retained Linking Items and Vertical Linking Constants Condition Procedure Level Items Removed Items Kept
(performed better at lower level) Vertical Level Linking Constant
1-2 1 5 -0.8403 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290
Robust Z
5-6 1 26 -0.3651 1-2 0 6 -0.7953 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290
Perpendicular Distance
5-6 0 27 -0.3976 1-2 2 4 -0.8776 2-3 5 5 -1.0231 3-4 7 11 0.3176 4-5 1 4 0.7700
Directional
0.3-Logit Difference
5-6 7 20 -0.2176 1-2 4 5 -0.6669 2-3 5 10 -0.4665 3-4 1 18 -0.0090 4-5 2 5 0.6323
Robust Z
5-6 4 26 -0.4638 1-2 0 6 -0.7953 2-3 0 10 -0.8005 3-4 0 18 0.0612 4-5 0 5 0.7290
Perpendicular Distance
5-6 0 27 -0.3976 1-2 7 6 -0.4772 2-3 7 10 -0.3760 3-4 7 16 -0.1311 4-5 7 5 0.4556
Non-Directional
0.3-Logit Difference
5-6 7 20 -0.2176
Table 3. Mean Vertical Linking Theta Estimates Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6
Robust Z -1.1362 -0.5780 -0.1589 0.3100 0.5304 0.8510
Perpendicular Distance -1.0912 -0.5780 -0.1589 0.3100 0.5304 0.8185 Directional
0.3-Logit Difference -1.3961 -0.8006 -0.1589 0.5664 0.8278 1.2959
Robust Z -0.6288 -0.2440 -0.1589 0.2398 0.3635 0.5854
Perpendicular Distance -1.0912 -0.5780 -0.1589 0.3100 0.5304 0.8185 Non-Directional
0.3-Logit Difference -0.3186 -0.1535 -0.1589 0.1177 0.0647 0.5328
Table 4. Slope and Intercept Values for Scale Transformation Condition Procedure Slope Intercept
Robust Z 32.6158 200.3098 Perpendicular Distance 34.3525 198.7805 Directional
0.3-Logit Difference 22.3434 206.0193
Robust Z 65.7895 167.2434 Perpendicular Distance 34.3525 198.7805 Non-Directional
0.3-Logit Difference 125.8812 98.2754
Table 5. Mean Vertical Linking Scale Scores Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6
Robust Z 178 190 209 217 225 231 Perpendicular Distance 177 188 208 217 224 230 Directional
0.3-logit Difference 185 194 212 223 229 237
Robust Z 156 168 184 197 206 212 Perpendicular Distance 177 188 208 217 224 230 Non-Directional
0.3-logit Difference 112 112 131 140 134 177
Table 6. Standard Deviation of Vertical Linking Scale Scores Condition Procedure Level 1 Level 2 Level 3 Level 4 Level 5 Level 6
Robust Z 34 31 31 30 33 27 Perpendicular Distance 36 33 33 31 34 28 Directional
0.3-logit Difference 23 22 21 20 22 18
Robust Z 69 63 62 60 66 54 Perpendicular Distance 36 33 33 31 34 28 Non-Directional
0.3-logit Difference 132 121 119 114 126 103
COMMON-ITEM SCREENING IN A VERTICAL SCALE
21
Table 7. Vertical Scale Effect Sizes
Condition Procedure Level 1/ Level 2
Level 2/ Level 3
Level 3/ Level 4
Level 4/ Level 5
Level 5/ Level 6
Robust Z 0.36 0.61 0.29 0.23 0.21 Perpendicular Distance 0.31 0.61 0.29 0.23 0.18 Directional
0.3-logit Difference 0.39 0.85 0.55 0.28 0.37
Robust Z 0.19 0.26 0.21 0.13 0.11 Perpendicular Distance 0.31 0.61 0.29 0.23 0.18 Non-Directional
0.3-logit Difference 0.00 0.16 0.08 -0.05 0.37
0
50
100
150
200
250
1 2 3 4 5 6
Level
Mea
n S
cale
Sco
re Directional: Robust Z
Directional: Perpendicular Distance
Directional: 0.3-logit Difference
NonDirectional: Robust Z
NonDirectional: Perpendicular Distance
NonDirectional: 0.3-logit Difference
Figure 1. Mean Vertical Linking Scale Scores
COMMON-ITEM SCREENING IN A VERTICAL SCALE
22
0
20
40
60
80
100
120
140
1 2 3 4 5 6
Level
Sta
ndar
d D
evia
tion Directional: Robust Z
Directional: Perpendicular Distance
Directional: 0.3-logit Difference
NonDirectional: Robust Z
NonDirectional: Perpendicular Distance
NonDirectional: 0.3-logit Difference
Figure 2. Standard Deviation of Vertical Linking Scale Scores
COMMON-ITEM SCREENING IN A VERTICAL SCALE
23
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
L1/L2 L2/L3 L3/L4 L4/L5 L5/L6
Level
Eff
ect
Siz
e
Directional: Robust Z
Directional: Perpendicular Distance
Directional: 0.3-logit Difference
NonDirectional: Robust Z
NonDirectional: Perpendicular Distance
NonDirectional: 0.3-logit Difference
Figure 3. Vertical Scale Effect Sizes
Appendix A
Robust Z Stability Check Guidelines
1. Calculate the mean and standard deviation for both sets of item difficulties for all linking items.
2. Calculate the ratio of standard deviations. 3. Calculate the correlation between the sets of item difficulties.
4. Calculate the robust Z statistic for each linking item and flag all linking items with an
absolute value of the robust Z greater than 1.645.
5. The ratio of the standard deviations (from step 2) must be between 0.9 and 1.1.
6. The correlation (from step 3) must be at least 0.95.
7. If the ratio of standard deviations or correlation is outside of the prescribed bounds, then remove the item whose absolute robust Z value is the largest and is greater than 1.645).
8. Recompute the ratio of standard deviations and correlation with the remaining linking
items. 9. Continue dropping items in a stepwise fashion until the ratio of standard deviations and
correlation are within the prescribed bounds, there are no items left with a robust Z greater than 1.645, or 20% of the linking set has been dropped. Note that the Robust Z values are not recalculated each time, only the ratio of standard deviations and correlation.