a comparison of three types of item analysis in test ... · tableofcontents-continued chapter page...
TRANSCRIPT
n
A COMPARISON OF THREE TYPES OF ITEM ANALYSISIN TEST DEVELOPMENT USING CLASSICAL
AND LATENT TRAIT METHODS
By
IRIS G. BENSON
A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OFTHE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THEDEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1977
^iiiiii
ACKNOWLEDGMENTS
I am deeply indebted to two special people who have greatly in-
fluenced by graduate education, Dr. William Ware, chairman of my
guidance committee, and Dr. Linda Crocker, unofficial cochairman of my
committee. Their continued encouragement and support has resulted in
my reaching this point in my graduate studies. I shall always be
extremely grateful to Dr. Ware and Dr. Crocker for whatever skills I
have developed as a researcher and as a teacher are in large part due
to their advice and guidance. To them 1 owe the high value I place on
objective, quantitative research methods. Further, 1 would like to
acknowledge the tremendous amount of time they spent in molding the
final copy of this manuscript.
1 would also like to express my appreciation to the members of my
committee. Dean John Newell and Dr. William Powell, for their suggestions
and editorial coimnents on this dissertation. Special thanks are extended
to Dr. Wilson Guertin for his assistance with portions of the study, and
as an unofficial member of my committee.
1 would like to thank Dr. Jeaninne Webb, Director of the Office of
Instructional Resources, and Mr. Robert Feinberg and Ms. Arlene Barry
of the Testing Division, for providing the data used in this study.
Finally, I would like to express my sincere appreciation to my
friends and family who stood by me during very trying times in my
graduate education.
11
y
TABLE OF CONTENTS
PAGE
ACKNOWLEDGMENTS ii
LIST OF TABLES y
LIST OF FIGURES vii
ABSTRACT ^iii
CHAPTER
I. INTE^ODUCTION1
The Problem 7
Purpose of the Study 8
Significance of the Study 10Organization of the Study H
II. REVIEW OF THE LITERATURE 13
Item Analysis Procedures for the Classical Model .... 13Research Related to Classical Item Analysis in
Test Development 18Simplified Methods of Obtaining ItemDiscrimination 21
Item Analysis Procedures for the Factor AnalyticModel 22Research Related to Factor Analysis in Test
Development .. , 24Comparison of .Factor Analysis to^-CTassical Item
Analysis . . "/'"r'""!^ T'*". . . . 7"""7""T' 25Item Analysis Procedures for the Latent Trait Model ... 27Research Related to Latent Trait Models in
Test Development 32Comparison of the fesch Model to Factor Analysis ... 34
Summary . .
'.'. .
"7~": :"". 36
III. METHOD 39
The Sample 39The Instrument 40The Procedure 42
Design 42Item Selection 44Double Cross-Validation j:_„---.-^—
-
•-—
'
-•—;. • • • 46Statistical Analyses , T'T 48
Summary 51
111
TABLE OF CONTENTS - Continued
CHAPTER PAGE
IV. RESULTS 53
Item Selection 54Double Cross-Validation 69Comparison of the 15 Item Tests on Precision 69Comparison of the 30 Item Tests on Precision 75Comparison of the 30 Item Tests on Lfficienc)' 81Summary 82
V. DISCUSSION AND CONCLUSIONS 87
The Precision of the Tests Produced by theThree Methods of Item Analysis 87Internal Consistency 88Standard Error of Measurement 89Types of Items Retained 90Conclusions 9I
The Efficiency of the Tests Produced by theThree Methods of Item Analysis 93Conclusions 95
Implications for Future Research 95
VI. SUMMARY 99
REFERENCES IO5
APPENDIX A: Mathematical Derivation of the Rasch Model .... 112
APPENDIX B: Relative Efficiency Values Used in Figure2 for the Comparisons Among Item AnalyticMethods 119
BIOGRAPHICAL SKETCH 120
IV
LIST OH TABLES
TABLE PAGE
1 DESCRIPTIVE DATA ON THE VERBAL APTITUDE SUBTEST
OF THE FLORIDA TWELFTH GRADE TEST 1975
ADMINISTRATION 41
2 SYSTEMATIC SAMPLING DESIGN OF THE STUDY N - 5,235 ... 43
3 DOUBLE CROSS-VALIDATION DESIGN OF THE STUDY 47
4 DEMOGRAPHIC BREAKDOWN BY ETHNIC ORIGIN AND
SEX FOR TOTAL SAMPLE 55
5 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED
ON CLASSICAL ITEM ANALYSIS FOR EACH SAMPLE SIZE .... 56
6 ITEM LOADINGS ON THE FIRST UNROTATED FACTOR FOR
THE 50 TEST ITEMS BASED ON FACTOR ANALYSIS FOR
EACH SAMPLE SIZE 58
7 SUMMARY STATISTICS ON THE 50 TEST ITEMS BASED
ON THE RASCH MODEL FOR EACH SAMPLE SIZE 61
8 DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATES
BASED ON THE RASCU MODEL ACCORDING TO SAMPLE SIZE ... 65
9 THE 15 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC
PROCEDURE ACCORDING TO SAMPLE SIZE 66
10 THE 30 BEST ITEMS SELECTED UNDER EACH ITEM ANALYTIC
PROCEDURE ACCORDING TO SAMPLE SIZE 67
11 DESCRIPTIVE STATISTICS FOR THE TEST COMPOSED OF
THE 15 BEST ITEMS SELECTED BY EACH ITEM
ANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 70
12 CONFIDENCE INTERVALS FOR THE OBSERVED INTERNAL
CONSISTENCY ESTIMATES BASED ON THE 15 ITEM
TESTS ACCORDING TO SAMPLE SIZE 72
13 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DIFFICULTY BY PROCEDURE AND SAMPLE SIZE 74
14 15 ITEM TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DISCRIMINATIONS BY PROCEDURE AND SAMPLE SIZE 75
15 POST HOC COMPARISONS OF THE DIFFERENCES BETWEEN
THE MEAN ITEM DISCRIMINATIONS FOR THE 15 ITEM TESTS . . 76
V
LIST OF TABLES - Continued
TABLE PAGE
16 DESCRIPTIVE STATISTICS 1-OR THE TEST COMPOSEDOF THE 30 BEST ITEMS SELECTED BY EACH ITEMANALYTIC PROCEDURE ACCORDING TO SAMPLE SIZE 77
17 CONFIDENCE INTERVALS FOR THE OBSERVEDINTERNAL CONSISTENCY ESTIMATES BASED ONTHE 30 ITEM TESTS ACCORDING TO SAMPLE SIZE 78
18 30 ITEM TESTS: DESCRIPTIVE STATISTICS FORITEM DIFFICULTY BY PROCEDURE AND SAMPLESIZE 80
19 50 ITEM TESTS: DESCRIPTIVE STATISTICS FORITEM DISCRIMINATIONS BY PROCEDURE AND SAMPLESIZE 80
VI
LIST OF FIGURES
FIGURE PAGE
1 HYPOTHETICAL ITEM CHARACTERISTIC CURVES FORTHE FOUR LATENT TRAIT MODELS 29
2 RELATIVE EFFICIENCY COMPARISONS FOR THETHREE 30 ITEM TESTS N = 995 S3
Vll
Abstract of Dissertation Presented to tl\e GraduateCouncil of the University of Florida in Partial Fulfillmentof tlie Requirements for the Degree of Doctor of P'ailosophy
A COMPARISON OF THREE TYPES OF ITEM ANALYSISIN TEST DEVELOPMENT USING CLASSICAL
AND LATENT TRAIT METHODS
By
Iris G. Benson
December 1977
Clia i rman : Willi am B . Wa r e
Major Department: Foundations of Education
Test reliability and validity are determined by the quality of the
items in the tests. Tlirough the application of item analysis procedures,
test constructors are able to obtain quantitative, objective information
useful in developing and judging the quality of a test and its items.
Classical test theory forms the basis for one method of test
development. An integral part of the development of tests based on the
classical model is selection of a final set of items from an item pool
based on classical item analysis or factor analysis. Classical item
analysis requires identification of single items which provide maximum
discrimination between individuals on the latent trait being measured.
The bi serial correlation between item score and total score is coimnonly
used as an index of item discrimination.
An alternative method of test development, but based on the
classical model, is factor analysis. Factor analysis is a more complex
test development procedure than classical item analysis. It is a
VI 11
statistical teclmique that takes into account the item correlation
with all other individual items in the test simultaneously. Thus,
classical item analysis can be viewed as a unidimensional basis for
item analysis, less sophisticated tlian tlie multidimensional procedure
of factor analysis.
Recently, the field of latent trait theory has provided a new approach
to test construction. Several latent trait models have been developed;
however, this study was concerned only with the one-parameter logistic
Rasch model. The Rasch model was chosen because it is the most parsi-
monious of the latent trait models and has recently been used in the
development and equating of tests.
A review of the literature revealed numerous studies conducted in
each of the three areas of item analysis, but no comparative studies
were reported among all three item analytic techniques. Therefore,
the present study was designed to compare the methods of classical
item analysis, factor analysis, and the Rasch model in terms of test
precision and relative efficiency.
An empirical study was designed to compare the effects of the
three methods of item analysis on test development across different
sample sizes of 250, 500, and 995 subjects. Item response data were
obtained from a sample of 5,235 high school seniors on a 50 item cogni-
tive test of verbal aptitude. The subjects were divided into nine
independent samples, one for each item analytic technique and sample
size. The study was conducted in three phases: item selection,
computation of item and test statistics for selected items on double
cross-validation samples, and statistical analyses of item characteris-
tics. For each item analytic procedure two tests were developed:
IX
a 15 item test, and a 50 item test. Four dependent variables were
obtained for eacli test to assess precision: internal consistency
estimates, standard error of measurement, item difficulties, and item
discriminations. In addition, tlic relative efficiencies of the 30 item
tests developed by each item analytic technicjue were compared for the
sample of 995 subjects.
The results of the analysis revealed tliat there were no differences
between the tests developed by the three methods of item analysis,
in terms of the precision of measurement. In terms of efficiency,
substantive differences between the tests produced by the three item
analytic methods were observed. Specifically, the tests based on class-
ical test theory were more effective for measuring very low and very
high ability students. Tlie Rasch developed test was more efficient for
assessing average and high ability students.
CHAPTFR I
INTRODUCTION
The systematic approach to test development was initiated by Binet
and Simon in 191(i. Since that time psychometricians have been concerned
with the extent to wliich accurate measurement of a person's "ability"
is possible. Most measurement experts agree that upon repeated testing
an individual's observed score will vary even though his true ability
remains constant. Tliis variability is the essence of classical test
tlieory
.
Classical test theory is based upon the assumption that a person's
observed score (X) is made up of a true score (T) and error score (E)
denoted:
X = T + E. (1)
Limited by few assumptions, this theory has wide applications. The few
assumptions pertain to the eri'or score (Magnusson, 1966, p. 64):
1. Tlie mean of an examinee's error scores on an infinite
number of jiarallel tests is zero.
2. The correlation between examinee's error scores on parallel
tests is zero.
3. The correlation between examinees' error scores and true
scores is zero.
Relying upon these assumptions, psychometricians liave used the observed
score (X) to represent the best estimate of a person's true score (T)
.
1
The accuracy of the observed score [XJ in representing an examinee's
true score (T) is described by the reliability coefficient. One definition
of reliability is given by the coefficient of precision. This coefficient
is the correlation between truly parallel tests, assuining the examinee's
true score does not change between two measurements. Lord and Novick
(1968] have defined truly parallel tests to be those for which, "the
expected values [true scores] of parallel measurements are equal; and the
observed score variances of parallel measurements are equal (p. 4S)."
The reliability coefficient for the population is defined as
(Lord and Novick, 1968, p. 134):
2 ''
r = "^T^ - 1 - ^e" , (2)XX ^- —
2 2where o is the true score variance, o,, is the observed score variance,
i A
and Op is the error score variance. IVhen this expression is used to
represent the coefficient of precision, it can be interpreted as the
extent to which unreliability is due solely to inadequacies of the test
form and testing procedure rather than due to changes in examinees over
time.
The coefficient of precision is a theoretical value because the
2 2components o and o cannot be observed. The coefficient of precision
is usually estimated by internal consistency methods. Internal con-
sistency is a measure of the relationship between random parallel tests.
Random parallel tests are composed of items drawn from the same population
of items (Magnusson, 1966, p. 102-105). Scores on these tests may
differ somewhat from true scores in means, standard deviations, and
correlations because of random errors in the sampling of items. However,
random parallel tests are more often encountered in practice than are
truly parallel tests. Cronbach's coefficient alpha (1951) is the
internal consistency coefficient commonly used to represent tlie average
correlation among all possible tests created by dividing the domain
into random lialves. Thus, the intern.al consistency coefficient indicates
the extent to which all tlie items are measuring the same ability or trait,
Psychological traits are often described as latent because they cannot
be directly observed. Therefore, psychological tests are developed in
an attempt to measure these latent traits.
Classical test tlieory forms the basis for one metliod of test
development. An integral part of the development of tests based on the
classical model is the utilization of classical item analysis or factor
analysis. Classical item analysis is a procedure to obtain a description
of the statistical cliaracteristics of each item in the test. This
approach requires identification of single items which provide maximum
discrimination between individuals on the latent trait being measured.
Theoretically, selecting items which have high correlations with total
test score will result in a discriminating test winch is homogeneous
with respect to tlie latent trait. Therefore, classical item analysis
is an aid to developing internally consistent tests.
An alternative method of test development, but based on the classi-
cal model, is factor analysis. Factor analysis is a more complex test
development procedure than classical item analysis. It is a statistical
technique that takes into account the item correlation with all other
individual items in the test simultaneously. Groups of similar items
tend to cluster together and comprise the latent traits (factors) under-
lying the test. Under the classical model then, classical item analysis
can be viewed as a unidimensional basis for item analysis, less
sophisi ticated than the multidimensional ]irocedure of factor analysis.
The purpose of factor analysis is to represent a variable in terms
of one or several underlying factors lHarman, 1967). Depending upon the
objective of the analysis, two general approaches are used in
factor analysis: (a) common factor analysis, and (h) principal com-
ponents analysis. A common factor solution would he warranted if the
researcher were interested in determining the number of common and
unique factors underlying a given test. A principal component solution
would be warranted if it were of interest to extract the maximum
amount of variance from a given test.
Regardless of the apju-oach used, factor analysis is an item analytic
technique in whicli all test items are considered simultaneously to pro-
duce a matrix of item correlations with factors. It is these correlations
or item loadings tliat indicate the strength of the factor and also the
number of factors underlying the test. However, factor analysis shares
the weakness of classical item analysis, that of being sample dependent.
Critics of classical test theory contend that a major weakness of
tests developed from this model is that the item statistics vary when
the examinee group changes; item statistics may also vary if a different
set of items from the same domain is used with the same examinee group
(llambleton and Cook, 1977; Wright, 1968). Thus, the selection of a final
set of test items will be sample dependent.
Until recently, classical item analysis and factor analysis were tlie
only techniques described in measurement texts for use in item analysis
and test development (Baker, 1977). However, with the publication of
Lord and Novick's Statistical Theories of Mental Test Scores (1968) and
the availability of computer programs, considerable attention is being
directed now toward the field of latent trait tlieory as a new area in
test development. Latent trait theory dates back to Lazarsfeld (1950)
who introduced the concept; however, Fredrick Lord is generally given
credit as the father of latent trait theory (Hambleton, Swaminathan,
Cook, Eignor, and Gifford, 1977). Proponents of this approach claim
that the advantages of latent trait theory over classical test theory
are twofold: (a) theoretically it provides item parameters which are
invariant across examinee samples which will differ with respect to
the latent trait, and (b) it provides item characteristic curves that
give insight into how specific items discriminate between students of
varying abilities. These properties of latent trait theory will be
presented in more detail in Chapter II.
Four latent trait models have been developed for use with
dichotomously scored data: the normal ogive, and tlie one-, two-, and
three-parameter logistic model (Hambleton and Cook, 1977; Lord and Novick,
1968). This study is concerned with the one-parameter logistic Rasch
model because it is the simplest of the four models.
Tests developed using the Rasch model are intended to provide
objective measurement of the examinee's true ability on the latent
trait in question, as well as providing for invariant item parameters
(Rasch, 1966; Wright, 1968). lliat is, any subset of items from a
population of items that have been calibrated by the Rasch model should
accurately measure the examinee's true ability regardless of whether the
items are very easy or very difficult; also, the item parameters should
remain constant over different examinees. In measurements obtained from
classical test theory this objective feature is rarely attained. The
item parameters associated with classical test theory are group and
item specific. That is, the item parameters are determined by the
ability of the people taking the test and the subset of items chosen.
Wright (196S) has stated, "llie growth of science depends on the develop-
ment of objective methods for transforming an observation into measure-
ment (p. S6)." Latent trait theory is an attempt to develop mental
measurement into a technique similar to measurement in the physical
sciences
.
Latent trait theory is based on strong assumptions that are re-
strictive and lience limit its application [Hambleton and Cook, 1977). The
assumptions required for the Rasch model are the following (Rasch, 1966):
1. Tlie test is unidimensional, e.g., there is only one factor
or trait underlying test performance.
2. The item responses of each examinee are locally independent,
e.g., success or failure on one item does not hinder other item responses.
3. Tlie item discriminations are equal, e.g., all items load
equally on the factor underlying the test.
Lord and Xovick (1968) noted that the assumptions of unidimensionality
and local independence are synonymous. To say that only one underlying
ability is being tested means the items are statistically independent
for persons at the same ability level. The third assumption relates to
item characteristic curves. Tlie item characteristic curve is a mathematical
function that relates the probiibility of success on an item to the
ability measured by tlie test. Curves vary in slope and intercept to
reflect how items vary in discrimination and difficulty. Tlie one-param-
eter logistic Rasch model (the one parameter is item difficulty) assumes
all item discriminations are equal. Tlius all item characteristic
curves should be similai- with respect to their slopes.
The Problem
Several studies have been conducted to varify the invariant prop-
erties of tests constructed using the Rascli model (Tinsley and Dawis,
1975; IVhitely and Dawis, 1974; Wright, 1968). If we assume that tests
developed using latent trait theory possess the quality of invariant
item statistics, why then liasn't latent trait theory been more visible
in the psychometric community? There appear to be three main reasons
for this slow acceptance. First, the Rasch procedure is based on a
mathematical model involving restrictive assumptions, e.g., the uni-
dimensionality of the items, the local independence of the items, and
equal item discriminations. A further restriction of the Rasch model
is the assumption of minimal guessing. However, several researchers
have demonstrated the robustness of the model with regard to departures
from the basic assumptions (Anderson, Kearney and Everett, 196S; Dinero
and Haertel, 1976; Rentz, 1976). Second, latent trait theory has not been
used in practical testing situations because until recently there was a
lack of available computer programs to handle the complex mathematical
calculations. Hambleton et al. (1977) described four computer programs
now available to the consumer. Third, measurement experts who are
knowledgeable about latent trait models have been skeptical as to tlie
real gains that may be available through this line of research. Are
tests developed using latent trait models superior to tests developed
using classical item analysis or factor analysis?
The purpose of this study was to compare the precision and efficiency
of cognitive tests constructed by tlie three methods (classical item
analysis, factor analysis and the Rasch model) from a common item and
examinee population. Precision, as measured by internal consistency,
8
is an overall estimate of a test's homogeneity, but provides no infor-
mation on how the test as a whole discriminates for tlie various ability
groups taking tlie test. For that reason measures test efficiency
(Lord, 1974a, 1974b) were incorporated into the study. Test efficiency
provides information on tlie effectiveness of one test over another as
a function of ability level. A cognitive college admissions subtest was
used in this study for several reasons. First, tests of this t)^e are
widely used by educational institutions for a large number of examinees
each year, in the areas of selection, placement, and academic counseling.
Most college admission examinations traditionally have been developed
using classical item analysis. Second, because of the importance of
the decisions made using such test scores, it would be worth investing
considerable time and expense in the development of these instruments.
Thus, the use of factor analysis or the Rascli model would be justified
if superiority of either of these methods over classical item analysis
could be determined. Third, tlie items on college admission tests have
been written by experts, and each subtest is intended to be unidimensional
,
e.g., items measuring a single ability. Thus, assumptions from all
models should be met. Fourth, because of the time required to take such
examinations, it is important to maximize the precision and tlie effect-
iveness of tlie tests. Tlie possibility of using fewer items while main-
taining precision would be desirable. Therefore, the question of which
test development procedure can best accomplisli this is not a trival one.
Purpose of the Study
The purpose of this study was to compare empirically the Rasch model
with classical item analysis and factor analysis in test development.
Five research questions guided tiiis study.
1. Kill the three methods of test development produce tests with
superior internal consistency estimates when compared to the jirojected
internal consistency of the population as the number of items decreases?
2. Will the three methods of test development produce tests
with stable estimates of internal consistency when the number of
examinees decreases?
3. Will the tliree methods of test development produce tests
with similar standard errors of measurement?
4. Will the three methods of test development select items that
are similar in terms of difficulty and discrimination?
5. Will the three methods of test development produce equally
efficient tests for all ability levels?
Hypotheses
This study investigated the capacities of three methods of test
development to increase precision and efficiency of measurement in test
construction. Tlie five questions posited in the previous section were
phrased as testable hypotheses:
1. There are no significant differences in the internal consistency
estimates of the tests produced by the three methods, as the number of items
decreases, wlien compared to the projected internal consistency estimates
for the population for tests of similar length.
The standard error of measurement (SEM) is defined in the classicalsense as (Magnusson, 1966, p. 79):
SEM = Sv vT~~rXX,
where Sv is the standard deviation of the test, and r is theXX
reliability coefficient.
10
2. There are no differences in the internal consistency estimates
of the tests produced by the three methods when the number of examinees
is decreased.
3. There are no meaningful" differences in the magnitude of the
standard error of measurement of the tests produced by the three
methods
.
4. There are no significant differences in the difficulties or
discriminations of the items selected by the three methods.
5. There are no differences across ability levels in the efficiency
of the tests produced by the three methods.
Significance of the Study
Objective measurement has always been assumed in the physical
sciences. It has only been recently that objective measurement in the
behavioral sciences has been deemed possible with the advent of latent
trait theory. Since the introduction of latent trait theory by
Lazarsfeld (1950) and Lord (1952a, 1953a, 1953b) much of the research
on latent trait models has been confined to theoretical research journals,
Wright (1968), speaking at a conference on testing problems, discussed
at an applied level the need to seriously consider latent trait theory
and the Rasch model in particular as a major test development technique
far superior to classical item analysis and factor analysis. However,
even in 1968 computer programs were not yet available to run the analyses
2Because test scores are usually reported and interpreted in
whole numbers, a "meaningful" difference in the standard error ofmeasurement is defined as a difference of ii 1.00.
11
should anyone beyond academicians be intex'csted. Today this obstacle
has been overcome, but many test developers remain unconvinced of the
value of latent trait theory because its superiority to classical test
theory has not been conclusively demonstrated. This study is an attempt
to provide an empirical comparison of classical test theory and latent
trait tlieory methods of test construction.
Of the various logistic models that represent latent trait theory
the Rasch model was chosen for comparison with traditional item
analysis procedures in the present study because it is the most
parsimonious latent trait model and has been used recently in the
development of the equating of tests fRentz and Bashaw, 1977; Woodcock,
1974). Tlie Rascli model provides a matliematical explanation for the
outcome of an event when an examinee attempts an item on a test. Rasch
(1966) stated that the outcome of an encounter is governed by the pro-
duct of tlie ability of tlie examinee and the easiness of the item and
nothing more. The imi)lication of this simple concept (objectivity of
measurement) would seem to revolutionize mental measurement. If invariant
properties of items and ability scores can be identified and used to
improv'e the psychometric quality of tests to an extent greater than now
possible with classical and factor analytic procedures then \ip ti-uly
are in the age of modern test theory.
Organization of the Study
The theoretical and empirical studies related to the three methods
of item analysis are described in Chapter II. An empirical investigation
to compare tlie three metliods of item analysis under varying conditions
is described in Chapter III. The results of the study are reported in
Chapter IV. A discussion of the results, conclusions of the study, and
12
implications for future research in this area have been presented in
the fiftli chapter. A sujiimarization of the study lias been provided in
Chapter VI.
CHAPTIZR II
REVIEW OF THE LITEilATURE
The quality of the items in a test determine its validity and
reliability. Tlirough tlie application of item analysis procedures, test
constructors are able to obtain quantitative objective information
useful in judging the quality of test items. Item analysis thus pro-
vides an empirical basis for revising the test, indicating which items
can be used again and which items have to be deleted or rewritten
(Lange, Lehmann, and Mehrens , 1967). Item analysis data also help
settle arguments and objections to specific items that might be raised
by administrators, test experts, examinees, or the public.
This study is focused on three approaches to item analysis (classical
item analysis, factor analysis, and the Rasch model) as test construction
techniques. It is assumed throughout this study that the test under
construction is unidimensional , e.g., all items are measuring only one
ability. These three approaches to item analysis and the relevant
research related to each method are discussed in this chapter.
Item Analysis Procedures for the Classi cal Model
Item analysis as a test development technique emerged at the begin-
ning of this century. Binet and Simon (1916) were among the first to
systematically validate test items. They noted the proportion of
students at particular age levels passing an item. This statistic was
13
14
measuring the relative difficulty of the items for different age groups.
The item difficulty index, defined as the percentage of persons passing
an item and denoted by p, is one ' of the statistics used in classical
item analysis.
Item difficulty is related to item variance and hence to the
internal consistency of the test. Test constructors are usually con-
cerned with achieving high test reliability, e.g., precision of measure-
ment. Therefore, an item difficulty of .50 is considered to be the ideal
value necessary to maximize test reliability. Tliis is because half
the examinees are getting the item correct and half the examinees are
missing the item. The proportion missing an item is defined as 1-p or
q. Tlius, when p is equal to .50, q is equal to .50. Uecause the
variance of a dichotomized item is p x q tlie maximum variation an item
can contribute to total test variance and ultimately to true-score
variance is .25. As an item's difficulty index deviates from .50, its
contribution to total test variance is always some value less than .25.
Hence test constructors have been advised (Gulliksen, 1945) to select
items with difficulty indices at or near .50. However, when items
are presented in multiple choice or alternate choice format, the ideal
3level of difficulty is adjusted to accommodate for guessing.
A second important item statistic in classical item analysis is
the item discrimination index. An item discrimination index is a measure
•
The ideal value of p = .50 assumes there has been no guessing onthe item. Tlie effects of guessing on item difficulty tends to increasethe ideal value of p. For example, on a four option multiple choiceitem the chance of guessing the correct answer is (J4)
( .501 = . 12. Thevalue of .12 is added to .50 to correct for the effect of guessing andthe ideal p would now be .62 (Lord, 1952b; Mehrens and Lehmann, 1975).
15
of how well the item discriminates between persons who have high test
scores and persons who have low test scores. Tlie discrimination index
is often expressed as a correlation between the item and total test
score. Ulien the criterion is total test score, the correlation coef-
ficient indicates the contribution that item makes to the test as a
whole, 'riius, on tests of academic achievement it is a measure of item
validity as well as a contributor to internal consistency. Noting an
increasing use of item analytic procedures for the improvement of
objective examinations, Richardson (1936) pointed out that the
development of tlie procedures of item analysis had centered primarily
around the invention of various indices of association between the test
item and tiie total test score, e.g., item discrimination indices.
The two most popular item-test correlation indices are the biserial
and point biserial correlations. The point biserial was developed by
Pearson (1900) and is a special case of the more general Pearson Product
Moment (PPM) correlation coefficient (Magnusson, 1966). Tliis index
is recommended when one of the variables being correlated (the item
score) represents a true dichotomy and the other variable (total test
score) is continuously distributed. Pearson (1909) also derived the
biserial correlation which is an estimate of the PPM. The biserial
correlation is recoiimiended when one of the variables (the item score)
has an underlying continuous and normal distribution which has been
artifically dichotomized and the other variable (total test score) is
continuously distributed. The assumption for the point biserial
correlation is often hard to justify when it is suspected that knowledge
required to answer an item is continuously distributed.
16
In considei'ing the dichotomized item [pass/fail], McNemar (1962]
has commented, "It is obvious that failing a test item represents any-
thing from a dismal failure up to a near pass, whereas passing the
item involves barely passing up to passing with the greatest of ease"
Cp. 191). Thus, the biserial correlation is usually favored over the
point biserial correlation as a measure of item discrimination. Also,
the biserial is often chosen over the point biserial because the
magnitude of the point biserial correlation for an item is not in-
dependent of the item difficulty (Davis, 1951; Henrysson, 1971;
Swineford, 1956) . Specifically, values of the point biserial are
systematically depressed as p approaches the extremes of .00 or 1.00.
Lord and Novick (196S) have pointed out that because of this bias, the
point biserial correlation tends to favor medium difficulty items over
easy or very difficulty items.
The formulae for the biserial and point biserial correlation
respectively are (Magnusson, 1966, p. 200 § 203):
^bis =
17
One of tlie main objectives of classical test theory is to improve
the internal consistency of the test under construction where internal
consistency was defined as the extent to which all items are measuring
the same ability. To ensure high internal consistency the random error
in the test must be minimized. As stated previously in Equation 2,
reliability, in the classical model, was defined as:
r = "t^ = 1 - °e2XX
0^2 0^2
Thus, the relationship among the test items can be noted in the
coefficient alpha formulae for estimating internal consistency for a
sample (Magnusson, 1966, pp. 116-117):
2r = n , 1 - ESi , ,^.XX
_ ^- (5)
orT
r ..- " ^ik , (6)XX
S2
^X
2where n is the number of test items, Z S. is the sum of the item
1
2variances, S.," is the variance of the test, and C, is the mean of the
X ik
item covariances. By comparing equation 2 witli 5, it is seen that the
2sum of the unique item variances is used as an estimate of cr ", and that
when the unique item variation is minimized internal consistency will
be high. Furthermore, the mean of the item covariances (equation 6)
2serves as an estimate of a . 'Hie size of the covariance term is in
turn determined by the intercorrelations and standard deviations of
the items (Magnusson, 1966). Therefore, internal consistency is directly
dependent upon the correlation amons the items in the test.
18
The item discrimination index provides a measure of how well i'n
item contributes to what the test as a whole measures. W\en items with
tlie highest item-test correlations are selected, the homogeneity of
the test is increased; tliat is, o ~ is increased. So it is the item
discrimination that directly affects test reliability. V/lien items with
low item-test correlations are eliminated, the remaining item inter-
correlations are raised. Wlien item-test correlations are high, the test
is able to discriminate between higli and low scorers and hence internal
consistency is increased. If too few items are discarded in an item
analysis tlie internal consistency of tlie test tends to decrease because
items with little power of measuring what tlie entire test is intended
to measure will dilute the measuring power of the efficient items
(Beddell, 1950).
Research Related to Classical Item Ajialysis in Test Development
Several articles have been published concerning standards for item
selection to maximize test validity and increase internal consistency.
Flanagan (1959J stated two considerations in selecting test items:
(a) the item must be valid, that is, it should discriminate between
high and low scorers, and [b) the level of item difficulty should be
suitable for the examinee group. Gulliksen (1945) agreed with Flanagan
on these two points and added a third; items selected with p = .50 would
produce tlie most valid tests; liowever, Gulliksen noted that current
practice was opposed to selecting items with difficulty near .50. Test
developers were selecting items based upon spreading difficulty indices
over a broad range
.
Several studies have been conducted to examine the effects of
varying item difficulty on test development. Brogden (1946), in a
•
19
study of test homogeneity, has shown empirically that a test of 45
items witli varying levels of item difficulty produced a reliability
of .96 (measured by tlie Kuder-Richardson^ formula). However, a
similar but longer test of 153 items, that had item difficulties at
.50 for all items, produced reliability of .99. Thus, Brogden con-
cluded tliat effective item selection was based more on selecting a
test with fewer items that possessed varying difficulty, than a longer
test with equal item difficulty.
Davis (1951), in coiimientina on item difficulty, stated that if
all test items had a difficulty of .50 and were uncorrelated then
maximum discrimination was acliieved. But when test items were cor-
related, maximum discrimination would only be achieved when the
difficulty index for all test items was spread out, e.g., several
difficult items, several easy items, and several items with difficulty
near .50. Davis recommended the latter procedure for test development
because test items are usually correlated to some degree. Davis also
recognized the need for tlie approval of subject matter specialists in
addition to statistical criteria in item selection.
In a study of test validity, Webster (1956) foimd results similar
to Brogden (1946), but different from Gulliksen (1945). By selecting
fewer items with high discrimination indices and varying item difficulty
levels, a more valid test was produced. Webster's results indicated
that a test of 178 items with difficulty indices near .50 had a validity
coefficient of .66. However, a test of 124 similar items with varying
item difficulties had a validity coefficient of .76, statistically
significant at p < .03 (based on £ to £ transformations).
20
Myers (1962), concerned by the current practice of selecting
items based on varying item difficulties instead of the, theoretical
idea of p = .50, compared tlie effect of the current practice to the
theoretical idea on reliability and validity of a scholastic aptitude
test. Tlie ideal item difficulty ranged from .40 to .74 in what he
called the peaked test. Items selected by the current practice were
outside the above range, and Myers called this the U-shaped test.
Two sets of il -Tis were selected for the peaked test and the U-shaped
test, four tests in all. Myers reported no statistically significant
differences in test validity when the different tests were correlated
with freshman grades. Test reliability was statistically significant
at P < .02 (using the Wilcoxon matched pairs sign test) in favor of the
peaked test. The reliability of the peaked test was .69. The
reliability of the U-shaped test was .65. The author noted that the
results above were based on a 24 item test, and that when test length
was projected to 48 items (via Spearman- Brown Prophecy Formula) there
were no significant differences in test reliability. Tlie studies of
Brogden (1946) and IVebster (1956) indicate that selecting items of
varying item difficulty tends to increase internal consistency and test
validity. Tlie results from Myer's (1962) study indicated just the
opposite, that item difficulty near .50 produced the more internally
consistent test. But this was only true for a relatively short test
of 24 items, and that when the test length was projected to 48 items,
there were no differences in the reliability of cither test based upon
the two methods of selecting items.
21
Simplified Methods of Obtaining Item Discriminations
A second major group of articles on classical item analysis has
dealt witli simplified metliods of obtaining indices of item discrimination.
Because of the lack of computers in the early years of test develop-
ment many psychometricians concerned themselves with devising tables
to provide quick estimates of item discrimination. Kelley ("1939)
found that in the computation of item discrimination only 54 percent of
the examinee group (based on total test score) needed to be used.
Considering the top 27 percent and the bottom 27 percent of the test
scorers resulted in a considerable savings in computational time.
Flanagan (1939) developed a table of item discriminations to estimate
tlie PPM correlation between item and test score based on Kelley's
extreme score groups of top and bottom 27 percent.
Fan (1952) developed a table for the estimation of the tetrachoric
correlation coefficient using the upper and lower 27 percent of the
scorers, ilie tetrachoric correlation is similar to the biserial
correlation, where the correlation is between two variables, which are
assumed to have a normal and continuous underlying distribution, but
have been artifically dichotomized.
Guilford (1954) presented several short cut tabular and graphic
solutions for estimating various types of correlation coefficients to
measure test item validity. Tliese methods result in saving a consider-
above amount of time when one is forced to use hand calculations.
Today these short cut methods can be used by classroom teachers who often
do not have the aid of calculators or computers. However, many test
constructors still use these classical methods of item analysis even
22
though computers are available with which more sopliist icated item
analytic techniques such as factor analysis or latent trait models
can be used.
Item Analysis Procedures for the Factor Analytic Model
Charles Spearman [1904) proposed a theory of measurement based on
the idea that every test was composed of one general factor and a num-
ber of specific factors. In order to test liis idea Spearman developed
the statistical procedure known as factor analysis.
"Factor analysis is a method of analyzing a set ofobservations from their intercorrelations to determinewhether tlie variations represented can be accounted foradequately by a number of basic categories smaller thanthat with which the investigation started" (Fruchter,1954, p. 1).
Factor analysis is a mathematical procedure which produces a linear
representation of a variable in terms of other variables (Harman, 1967).
In the case of test items being factor analyzed, a matrix of item
intercorrelations is obtained first. Subsequently, the matrix of item
correlations is submitted to tlie factoring process. There are two
basic alternatives within the framework of factor analysis for analyzing
a set of data: coimnon factor analysis, based on the work of Spearman
and later Thurstone (1947); and principal components, developed by
Hotelling (1953). Tlie major distinction between the two methods relates
to the amount of variance analyzed, e.g., the values placed in tlie
diagonal of the intercorrelation matrix. Factoring of the correlation
matrix with unities in the diagonal leads to principal components, while
^ . 4factoring the correlation matrix with communalities in the diagonal
4,''
The communality (h~>, of a variable is defined as the sum of thesquared factor loadings h" = a. + a.^^ + ... a." (llarman, 1967, p. 17),see formula 8. J 3- jn
23
leads to common factor analysis (llarman, 1967) . If it is of interest
to know what the test items share in common, a common factor solution
is warranted. But if it is of interest to make comparisons to other
tests or other test development procedures, a principal components
solution is warranted. Since the present study was initiated to com-
pare three different test development techniques, a principal com-
ponents solution was used in this study to analyze the date under the
factor analytic model.
The linear model for the principal components procedure is
defined as (Harman, 1967, p. 15):
Z.. = a.,F, + a.-F- + . . .a. F . (7)ji jl 1 j2 2 jn n
Z.. is the variable for item) of interest, and a., is the coefficient,Jl jl
or more frequently referred to as the loading of variable Z.. on com-
ponent, (F,). An important feature of principal components is that the
extracted components account for the maximum amount of variance from
the original variables. Each principal component extracted is a linear
combination of the original variables and is uncorrelated with sub-
sequent components extracted. Thus, the sum of the variances of all
n principal components is equal to the sum of the variances of the
original variables (Harman, 1967). According to Guertin and Bailey
(1970), the principal components solution was designed basically for
prediction, hence the need to use the maximum amount of variance in a
set of variables
.
Since factor analysis is based upon a matrix of intercorrelations
,
it is important tliat care be taken in selecting the appropriate
coefficient. Several item coefficients are available: phi, phi/phi
max, and the tetrachoric correlation coefficient. Carroll (1961)
24
pointed out several problems concerning the clioice of a correlation
coefficient to be used in factor analysis. The phi coefficient (used
where both variables are true dichotomies) was found to be affected
by disparate marginal distributions and often imderestimated tlie PPM.
The phi/phi max coefficient was developed to correct for the under-
estimation of phi, but the correction is not enoush to counter the
effect of extreme dichotomizations. Carroll recoiranended the tetrachoric
coefficient as being the least biased by extreme marginal splits
providing the variable uiider consideration was nori:ially distributed in
the population. Wherry and Winer (1955) had made conclusions similar
to Carroll, but went on to say that when the normality assumption was
met and the regression of test score on the item was linear the PPM
and tetraclioric are identical. Tlie tetrachoric correlation was used in
the present study to obtain item intercorrelations
.
Research Related to Factor Analysis in Test Development
The early use of factor analysis to construct and refine tests
was suggested by the work of McNemar (1942) in revising the Stanford-
Binet scales, and Burt and John (1943) in analyzing the Terman-Binet
scales.
Several contemporary psychometricians have advocated the use of
factor analysis in developing unidimensional tests (Cattell, 1957;
Hambleton and Traub, 1975; Henrysson, 19b2; Lord and Novick, 1968). A
unidimensional test was defined briefly in the introduction to this
chapter, but a more precise definition is warranted. Lumsden (1961)
noted that a unidimensional test can be determined by the examinee
response patterns. If the test items are arranged from easiest to
hardest, person^ who misses item will miss all the other items, and
25
person-, who gets item correct but misses item^ will miss all the
subsequent items and so on. The above statement assumes infallible
items. However, most tests constructed today contain fallible items,
thus the response pattern will be disturbed by random error. Lumsden
suggested in developing unidimensional tests factorial ly that the items
be carefully selected on empirical grounds, tlms reducing the problem of
too many heterogeneous items and the possibility of obtaining multiple
factors. By preselecting items one increases the chances of the items
converging on one factor.
The importance of developing unidimensional tests is demon-
strated most clearly in considering the concepts of test reliability
and validity. For a test to be valid it must actually measure the trait
it was intended to measure. For a test to be reliable it must provide
similar results upon repeated measurement. It should be easier to
estimate these two important aspects of a test when the test is
unidimensional than when the test is multidimensional, hence the use of
a unidimensional test in the present study.
Cattell (1957) has suggested that in the development of a factor
homogeneous scale, one should preselect items, carry out a preliminary
factor analysis, then select for further analysis those items which
load on the first factor. Cattell defined an index of unidimensionality
as the ratio of the variance of the first factor to the total test
variance. This index has no set criterion and the sampling distribution
is unknown.
Comparison of Factor Analysis to Classical Item Analysis
One measure of item validity, the biserial correlation was described
for classical item analysis procedures. This same index is also obtained
26
by factor analysis. U'lien tiie test items are factor analyzed, the factor
loading a... is the item- factor association that is considered ao ij
measure of item validity, e.g., the liigher tlie factor loading, the
greater the relationship between the item and the factor it iieasures.
Tlie factor loadings can be viewed as similar to the biserial correlations
discussed under classical test theory. Tliis relationship between
factor loadings and biserial correl: tions has been discussed by several
authorities (Guertin and Bailey, 197;'; Henrysson, 1962; Richardson,
1936).
Factor analysis as an item analytic technique was not realistically
possible for most psychometricians until the advent of high speed
computers. Guertin and Bailey (1970) iiave predicted that with the
increasing use of computers factor analysis will replace classical item
analysis as a test development technique. Because it is possible for
a test to reach the highest degree of homogeneity and yet be factorial ly
a very odd mixture of factors (Cattell and Tsujioka, 1964), classical
item analysis alone is not sufficient to determine if a test is
unidimensional . However, factor analysis not only provides a measure
of item-test correlation (the factor loading), it also provides an
indication of how many items form a unifactor test. Thus, factor
analysis has been advocated as a superior technique to classical item
analysis (Guertin and Bailey, 1970). Using factor analysis in test
development, psycliometricians liave advanced beyond an independent
analysis of item intercorrelations to a simultaneous analysis of item
intercorrelations with other individual items to obtain a measure of
test unidimensionality and item-factor association.
27
However, there is an inherent flaw in factor analysis as there
was in classical item analysis in test development. Tlie flaw is that
both procedures are sample dependent. Wien an item analysis procedure,
or any procedure in general is sample dependent, it means that the
results will vary from group to group, fflien the groups are very
dissimilar, there is much variability. Gulliksen (1950) noted that a
significant advance in item analysis tlieory would be made when a
method of obtaining invariant item parameters could be discovered. To
that end latent trait theory is an attempt to identify invariant
item parameters.
Item Ajialvsis Procedures for the Latent Trait Model
Latent trait theory specifies a relationship between the observable
examinee test performance and the unobservable traits or abilities
assumed to underlie performance on a test (liambleton et al., 1977). The
relationship is described by a mathematical function; hence latent
trait models are mathematical models. As noted earlier, there are four
major latent trait models for use with dichotomously scored data: the
normal ogive, and tlie one-, two-, and three-parameter logistic models
(Hambleton and Cook, 1977; Lord and Novick, 196S) . All four models are
based on the assumption that the items in the test are measuring one
common ability and that the assumption of local independence exists
between the items and examinees. These two assumptions imply that a
test which measures only one trait or ability will have less measurement
error in tlie test score than a test that is multidimensional, and that
tl;e response of an examinee to one item is not related to his response
on any other item. Ifliere the latent trait models begin to differ is
with respect to the shape of their item characteristic curves.
28
Tlxe normal ogive, develoj^ed by Lord (1952a, 1955a), produces
an item characteristic curve based on the following formula:
a (e - b ) (8)Pg CQ) = / f ^ e(t)dt,
where P^ (6) is the probability that an examinee with ability 6 correctly
answers item g, 6(t) is the normal density function, b represents item
difficulty and a represents item discrimination.
The item characteristic curve of thertwo-parameter logistic model
developed by Birnbaum (1968) has the same shape as the normal ogive,
and Baker (1961) has shown them to be equivalent mathematical procedures.
Tlie shape of the item characteristic cui-ve of the two-parameter
logistic function is developed from the following formula:
Dag(e - b )
'^ ''' -, . e
^-^ '^ - V '"'
P (9), ^ and b have the same interpretation as in the normal ogive.& o t>
D is a scaling factor equal to 1.7 (the adjustment between the logistic
function and normal density functioii) , and e is the natural log function.
In Figure la the shape of the normal ogive and the two-parameter
logistic curve has been illustrated. In the Figure, item A is more
discriminating tlian item B as noted by the steepness of the slopes.
Ilie. three-parameter logistic model also developed by Birnbaum (1968)
includes as an additional parameter, an index for guessing. The
mathematical form of the three-parameter logistic curve is denoted,
^.Dag(9 - b )
S ^"^'8 ^* ^g'
1 + e^-^^ ^^ - ^^
Tlie parameter c the lower asympotote of the item characteristic curve,
represents the probability of low ability examinees correctly answering
29
-T— r^ r-3.0 -2.0 -1.0
ABILITY CONTINUUM
NORMAL OGIVE & TWO-PARAMETER LOGISTIC CURVE (a)
1.00 -T-
T-—r——r r-1.0 1.0 ZO
ABILITY CONTINUUM
13.0
THREE-PARAMETER LOGISTIC CURVE (b)
uj 1.00 T< CO
o2 .75
> COr un °= .50 HCD •-
o '^.25
T-3.0 -ZO -i.O 1.0
ABILITY CONTINUUM
RASCH ONE- PARAMETER LOGISTIC CURVE (c)
FIGURE L HYPOTHETICAL ITEM CHARACTERISTIC
CURVES FOR THE FOUR LATENT TRAIT
MODELS.
30
an item (llambleton et al., 1977). In Figure lb the shape of the
three-parameter logistic curve has been illustrated. In tlie Figure,
item A is more discriiainating and has less guessing involved than item B.
The one-parameter logistic model, developed by Rascli (1960)
is commonly referred to as the Rasch model. The Rasch model, though
similar to the other latent trait models, was developed independently
from the other models. The Rasch model is based upon two propositions:
(a) the smarter an examinee, the more likely he is to answer the item
correctly, and (b) an examinee is more likely to answer an easy item
correctly than a difficult item. Matliematical ly the above propositions
can be stated in terms of odds or probability of success on an item.
Tlie odds of an examinee with ability t) correctly answering an item with
difficulty <', is given by the ratio of G to c, (Rasch, 1960):
odds = 8 . (11)
The derivation of equation 11 was presented in Appendix A. Equation 11
re formally written in the following equation is the Rasch model.mo
ef\ - *i^ (l^)
Inequation 12, tlie probability of examinee k making a correct response
to item i, noted X = 1 , given an examinee of ability 3, (where 3, is thek k
log transformation of 0) taking an item of difficulty 6. (where 6. is the1 1
log transformation of cj is a function of the difference between the
examinee's ability and the item's difficulty. The derivation of
equation 11 to equation 12 is presented in Appendix A.
The assumptions for the Rasch model were discussed in Chapter I.
Essentially the three assumptions are as follows:
31
1. There is only one trait underlying test performance.
2. Item responses of each examinee are statistically independent.
3. Item discriminations are equal.
The first two assumptions can be checked by conducting a factor analysis
of the test items as suggested by Lord and Novick (1968) , and Hambleton
and Traub (1973) . Tlie assumptions are met if one dominant factor
emerges from tlie analysis. Tlie tliird assumption can be checked by
plotting item characteristic curves for each item. In Figure Ic the
item characteristic curves for two hypothetical items based on the Rasch
model have been illustrated. The difficulty for items A and B is .Sand
L5 respectively (point where p = .50), and the discriminations of the
two items are equal. The assumption that all items have equal dis-
criminations is quite restrictive; however, Rentz (1976) demonstrated,
in a simulation study, that the item slopes can deviate from 1 (where
all slopes are equal) + .25 and still fit the model. In a similar
simulation study, Dinero and Haertel (1976) concluded that the lack of
an item discrimination parameter in the Rasch model does not result
in poor item calibrations when discriminations are varied as much as .25.
The estimates for the Rasch parameters 3, and 5., examinee abilityk 1
estimate and item difficulty estimate respectively, are sufficient,
consistent, efficient, and unbiased (Anderson, 1973; Bock and Wood, 1971)
That is, the examinee's test score will contain all the information
necessary to measure the person ability parameter S, , and the sum of the
right answers to a given item will contain all the information used to
calibrate the item parameter (S . (Wright, 1977). Of tlie latent trait
models, the Rasch model is imique in tliis respect.
Tlie matliematical rationale of the Rasch noJcl is based upon the
separation of the ability and item difficulty parameters. As shown
in Appendix A, the estimation of- the item parameters is independent of
the distribution of ability and ability independent of the distribution
of item difficulty (Rasch, 1966). Several studies have demonstrated
this (Anderson et al., 1968; Tinsley and Dawis, 1975; Uliitely and
Dawis, 1974; Miitely and Dawis, 1976; IVright, 1968; Wright and
Panchapakeson, 1969). The separation of the ability and item param-
eters leads to what Rasch has termed specific objectivity. Specific
objectivity relates to the fact that the measurement of a person's
ability is not dependent upon the sample of items used, nor the
examinee group in which a person is tested. Once a set of items has been
calibrated to the Rasch model, any subset of tlie calibrated items will
produce the same estimate of the examinee's ability. This type of
objectivity is possessed by the physical sciences and the goal toward
which mental measurement should be aimed in the future. Toward the
goal of objective measurement several researchers have conducted
empirical studies comparing classical factor analytic test development
procedures to the latent trait models, and also comparisons have been
made between the various latent trait models.
Research Related to Latent Trait Models in Test Development
Baker (1961) conducted one of the earlier comparative studies
between two latent trait models. He compared the effect of fitting the
normal ogive and the two-parameter logistic model to the same set of
data, a scholastic aptitude test. The two-parameter model as well as
the noi-mal ogive provide item difficulty and item discrimination estimates,
33
The empirical results suggest there is little difference between the
two procedures as measured by a chi-square test of fit. However, Baker
noted the computer running time of the logistic model was one-third
that of the ogive model, thus he concluded the logistic model was
more efficient in terms of cost than the ogive.
Hambleton and Traub (1971) compared the efficiency of ability
estimates provided by the Rasch model and the two-parameter model to
the three-parameter logistic model using Bimbaum's concept of infor-
mation (1968). The three-parameter model provides item difficulty
and discrimination estimates as well as accounting for guessing on
each item. Eleven simulated tests of fifteen items each were generated
varying item discrimination and degree of guessing. ITie authors
sought to determine how efficient the one- and two-parameter logistic
models were under these conditions taking the three-parameter model to
be the true model. Tlie results indicated that wlien guessing was a
factor the three-parameter model was most efficient in providing ability
estimates, but when guessing was not a factor all models were equally
efficient. Since the Rasch model has fewer parameters to estimate, hence
it takes less computer time to run than the other two models, it would
be preferred in the absence of guessing. In considering item
discrimination, when the guessing parameter was set to zero, the Rasch
model was as efficient as the two-parameter model when item discrimination
varied from .59 to .79. As item discrimination deviated from this range
the two-parameter model was more efficient.
Hambleton and Traub (1975) compared the one- and two-parameter
models with three sets of real data (the verbal and mathematics subtests
of a scholastic aptitude test used in Ontario (items = 45 and 20
34
respectively), and tlie veii)al section of the Scliolastic Aptitude Test
(SAT, items - 80). llieir results indicated that generally tlie two-
parameter model fit the data better than the one-parameter model. The
loss in predicting performance was greatest on tlie shorter mathematics
test and smallest on tlie longer SAT. iliese findings confirm Birnbaum's
conjecture (1908, p. 492) that if the number of items in a test is very
large tlie inferences that can be made about an examinee's ability will
be much the same whether the Rasch model or the two-parameter logistic
model is used. The authors questioned whetlier the gain obtained with the
two-parameter model is worth tlie increased computer cost of estimating
the item discrimination parameter. Based on the results of these studies,
it is concluded that the Rascli model is tlie most efficient of the latent
trait models and hence will be used in comparison to the more traditional
methods of test development included in the present study.
Comparison of tlie Rasch Model to Factor Analysis
Two recent studies have been completed comparing the Rasch model to
factor analysis. Anderson (1976) posed two questions concerning the
Rasch model and factor analysis: (a) what types of items would be
excluded in terms of difficulty and discrimination using Rasch and
factor analysis as item analytic teclmiques, and (b) what effect would
the two procedures have on validity? Anderson chose to use 235 middle
school students' responses to a 15 item Likert-type scale that was
dichotomized for use witli tlie Rasch model and the factor analytic
procedures. A principal component factor analysis based upon tetrachoric
correlation coefficients w a s compared to the Rasch model using the
CALl-'lT computer program (Wright and Mead, 1975). Only items fitting
the model were used. His results indicated that the Rasch procedure
35
eliminated the more difficult items and the factor analytic procedure
eliminated the easier items; a statistically significant difference as
determined by clii-square test at £ < .01. For item discriminations the
Rasch procedure eliminated very low and very high item discriminations,
while the factor analytic procedure tended to reject only very low
discriminations. Tlie difference here was not statistically significant.
The second question of test validity showed very similar results for
the two procedures wlien test score was correlated with course grade
point average.
In a similar study Mandeville and Smarr (1976) developed a two
stage design. First they compared the Rasch procedure to factor analysis,
then they combined the two analytic procedures. The authors felt the
combined approacli would be a more effective item analytic approach
than any single method in detei^mining whicli items fit the Rasch model.
Two cognitive data sets (one standardized and one classroom) and one
simulated set were used in the study. A rotated principal axis factor
analysis based upon phi correlation coefficients were compared to the
Rasch model using tiie CALFIT program.
The results indicated that for the standardized and simulated data
sets the double procedure of factor analyzing the items, then submitting
only the items loading on the first factor to the Rasch procedure was
not really useful. The Rasch procedure alone was just as effective as
the double procedure in selecting items that fit tlie model.
For the classroom data set the investigators found that 92 percent
of the items fit the Rasch model, but upon factor analyzing these
items only seven percent of the total test variance was associated with
tl)e first factor. Their results tend to indicate that factor analysis
56
and the Rascl; procedure do not always identify the same unidimensional
trait underlying test performance. However, the results of the
Mandeville and Smarr study may be' suspect for three reasons. First,
the plii coefficient, which can be seriously affected when p and q
take on extreme values, was used as a basis to form the intercorrelation
matrix that was factor analyzed. The greater the difference in p and q
the smaller will be the maximum correlation, hence very easy and very
difficult itej-.is will have systematically lower coefficients and will
tend to bias the results of the analysis in favor of moderately difficult
items. Second, the factor analysis was based on a principal axis
solution, using some value less than 1.00 in tlie diagonal hence less
variance is being used in the total solution for comparison with the
Rasch procedure that is utilizing all the test variance available.
Third, the principal axis solution was rotated so that the total
variance associated with the first factor has been distributed out
among the other factors and was no longer as strong as it once had been.
Suimnary
In the development of tests based upon classical item analysis
two main statistics are used in reviewing and revising test items, e.g.,
item difficulty and item discrimination. The item discrimination index
provides information as to the validity of the item in relation to total
test score, while item difficulty indicates ]:ow appropriate the item was
for the group tested. A serious limitation of classical item analysis
is that the statistics obtained for examinees and items are sample
dependent (Hambleton and Cook, 1977; bright, 196S)
.
The same problem of sample dependency also exists for factor
analysis. However, factor analysis is viewed as a superior technique
37
to classical item analysis for two reasons: (a) factor analysis
compares item intercorrelations with other items simultaneously,
and (b) factor analysis provides an indication of how many factors
or abilities the test is measuring. Also in factor analysis, the
factor loading is comparable to the item discrimin ition index of
classical item analysis, thus providing a measure of item validity
for eacli item on each factor in the test.
Not until the development of latent trait models was a solution
suggested to the problem of sample dependency of tiie statistics for
items and examinees. The Rasch model in particular has been shown to
provide item statistics that are independent of the group on which they
were obtained, as well as examinee statistics that are independent of
the group of items on wliicli they were tested. Tliis feature of the
Rasch model provides for more objective mental measurement.
The Rasch model lias been compared to other latent trait models
and has been shown to be as efficient in many cases as the more complex
models. Tiie Rasch model has also been compared with factor analytic
procedures in determining test unidimensionality , validity, and types
of items retained and excluded by tlie two procedures. Missing from
this review is a comparative study of the three item analytic techniques
using the same data base and a comparison of the efficiency of tests
developed from the three techniques across ability levels. Also missing
from the literature is the effect of varying sample size and number of
items as well as the kinds of items each of the three procedures would
eitlier retain or exclude in test development.
38
It is apparent that an empirical investigation into these areas
seems warranted to determine which procedure under tlie various
conditions would produce the superior test in terms of internal
consistency and efficiency. It was for this reason that the present
study was undertaken comparing the three methods of classical item
analysis, factor analysis, and the Rasch model used in test development,
The design of tlie study is described in Chapter III.
CHAPTER III
METHOD
An empirical study was designed to compare the effects of three
methods of item analysis on test development for different sample sizes.
The three methods of item analysis studied were classical item analysis,
factor analysis, and Rasch analysis. The sample sizes used to compare
the tliree item analytic methods were 250, 500, and 995 subjects. The
study was designed in three phases: (a) item selection, (b) a double
cross-validation of the selected items, and (c) statistical analyses
of the selected items. For each item analytic procedure two tests
were developed, a 15 item test, and a 30 item test. Four dependent
variables were obtained for each test: (a) an estimate of internal
consistency, (b) the standard error of measurement, (c) item difficulty,
and (d) item discrimination. A description of the subjects, instrument
used, research design, and statistical analyses is presented in this
chapter.
The Sample
In the fall of 1975, all high school seniors in the State of
Florida (N = 78,751) were tested as part of the State assessment program.
The population was from 435 high schools throughout the state. From
this population a 1 in 15 systematic sample of 5,250 subjects was
chosen (Mendenhall , Ott, and Scheaffer, 1971). A systematic sample was
selected to ensure samples from every high school in the state. The
39
40
types of data obtained on each suoject were sex, race, item responses,
and total score.
The data file was edited to remove those subjects who cither
answered all the items correctly or incorrectly. The rationale for
this procedure was that the Rasch model cannot calibrate items when
a person has a perfect score or the alternative, when a person has no
items correct [Wright, 1977). Through the editing procedure 15 subjects
were removed, thus the available sajnple size was 5,255. Because such
a small number of subjects were removed, it seems unlikely that the
elimination of these subjects would bias the results in favor of any
of the three item analytic techniques.
The Instrument
Tlie instrument selected for use in this study was the Verbal
Aptitude subtest of the Florida Twelfth Grade Test, developed by the
Educational Testing Service. It is a statewide assessment battery
which has been administered every year since 1935 (Benson, 1975). The
Verbal Aptitude subtest is comprised of 50 verbal analogies, in a
multiple choice format, from which a single score based on the number
of items correct is reported. Descriptive information on the Verbal
Aptitude subtest for the population tested in 1975 is presented in
Table 1.
This particular instrument was selected for three reasons. First,
it is a cognitive measure of verbal ability and much of classical
test theory has been build upon tests in the cognitive domain. Second,
it is similar to and hence representative of other national aotitude
tests used for college admissions. Tliird, it has a large data pool
from which to sample.
41
TABLE 1
DliSCRIPTIVH DATA ON THE VERBAL APTITUDE SUBTESTOF THE FLORIDA TWELFTH GRADE TEST
1975 ADMINISTRATION
Number of Schools = 435 Number of Students - 78,751
Number of items 50Mean 25.95Standard Deviation 8.23Reliability^ .88
Standard Error of Measurement 2.85
Note : Data obtained from the Florida Twelfth Grade TestingProgram, Report No. 1-75, Fall 1975,
Reliability based on the split-half method, and corrected bythe Spearman-Brown fox'mula.
42
Classical test theor)' lias been built mainly around the development
of cognitive tests. Therefore, it seemed desirable to compare the
new procedures of latent trait theory, via the Rascli model to the
procedures of classical test tlieory, e.g., factor analysis and classical
item analysis by using a cognitive test. Thus, the results may be
more generalizable to the major type of tests developed by practitioners
in the field.
The Procedure
Design
The sample of 5,235 v.'as divided into nine systematic samples in
the following manner:
Group = three independent samples of 250 students each;
Group = three independent samples of 500 students each;
Group_ = three independent samples of 995 students each.
From the initial editing of the data file, previously described, 15
subjects were removed from the total sample of 5,250. Therefore,
it was decided that this loss of subjects would only affect Group
since it was the largest. Thus, the number of subjects in each of
the three independent samples was reduced by five, resulting in three
independent samples of 995 subjects each.
The purpose of obtaining the three separate samples for t]>e three
groups was to insure that each item analytic and double cross-validation
procedure used an independent sample, so that tests of statistical
significance could be perform.ed. The scheme shown in Table 2 was used
to obtain tlie nine samples. In the present study the independent
variables were sample size and item analytic procedure.
43
TABLE 2
SYSTEMATIC SAMPLING DESIGN OF THE STUDY"
N = 5,235
44
The item data were analyzed in thi-ee pliases : (a) selection of the
items, (b) computation of item and test statistics for selected items
on double cross-validation samples, and (c) statistical analyses of
item characteristics to test tlie hypotheses.
Item Selection
The three independent samples, within each of the groups of subjects
(N = 250, N = 500, N = 995), were submitted to one of the three item
analytic procedures (in accordance with Table 2) in order to select a
specified number of items, e.g., the "best" 15 and 30 items. Each of
these two sets of items comprised two separate tests; however, all of
the items on the 15 item tests were always included on each of tlie 30 item
tests. A different process for selecting the items was used with each
item analytic technique, and has been described in the following three
sections.
Classical item analysis . The definition of the "best" items was
based on the numerical magnitude of the items' biserial correlations.
The biserial correlation was defined as the correlation between the
artifically dichotomized item score (1 or 0) and total test score.
In using the biserial correlation the assumption was made that the
artifically dichotomized variable (the item) had a continuous and normal
distribution (Magnusson, 1966)
.
In order to obtain biserial correlations for the items under the
classical item analysis procedure, the 50 vei'bal items were submitted
5to the item analysis program, GITAP for each of the three sample sizes.
The Generalized Item Analysis Program (GITAP) is a part of thetest analysis package developed by F. B. Baker and T. J. Martin,Occasional Paper No. 10, Michigan State University, 1970.
45
The 15 and 30 items with the highest biserial correlations were selected
as the best items from the total subtest. Item difficulties were also
obtained for the "best" 15 and 30 items selected. Item difficulty has
been defined as the proportion of persons getting a particular item
correct out of the total number of persons attempting that item
(Mehrens and Lehman, 1973).
Factor analysis . Item selection based on factor analysis was
accomplished using the computer programs developed for the Education
Evaluation Laboratory at the University of Florida. Tliese programs
have been described by Guertin and Bailey (1970). The present study
was concerned only with the items that load on the first principal
component, in order to adhere to the unidimensionality assumption of
the test. The principal components analysis was based on a matrix of
tetrachoric item intercorrelations with unities in the diagonal.
Tlie tetrachoric correlation was chosen to produce the intercorrela-
tion matrix for the same reason tlie biserial correlation was chosen:
Knowledge of an item was assiomed to be normal and continuously dis-
tributed. In the case of the tetrachoric correlation each item (scored
1 or 0) was correlated with every other item.
The 15 and 30 items with the highest loadings on the first unrotated
principal component were selected from the total subtest. These com-
ponent loadings are analogous to biserial correlations previously
described, where the loading refers to the relationship of the item to
the principal component or factor (Guertin and Bailey, 1970, Henrysson,
1962).
Rasch analysis, llie selection of items based on the Rasch model
was accomplished in two stages. First, in order to check the assumption
46
)f a unidimensional test, a factor analysis using a principal components
solution was used. Items were selected with loadings between .39 and
79 on the first unrotated factor-, to hold the discrimination index
of the items constant. Hambleton and Traub (1971) have shown that the
efficiency of a test developed using the Rasch model will remain very
high (over 95 percent) when the range on the discrimination index was
held between .59 and .79. Second, the items selected from the principal
components solution using the above criteria were submitted to a
Rasch analysis using the IbICAL i/rogram (Wright and Mead, 1976) . Items
were selected based upon the mean square fit of the items to the
Rasch model. The best 15 and 30 items fitting the model were chosen
from the total subtest, and their corresponding item difficulties
reported.
Double Cross-Validation
A double cross-validation design (Mosier, 1951) was used to obtain
item parameter estimates for the best 15 and 50 items selected by the
three item analytic techniques for the three sample sizes. In this
study a 5 X 3 latin square was used to reassign samples. This procedure
ensured that the estimates of the item parameters would be based
upon a different sample of subjects than the original sample used to
identify the best items. Each item analytic technique was randomly
reassigned, using a latin square procedure (Cochran and Cox, 1957,
p. 121), to a different sample within each of the three groups (N - 250,
N = 500, N = 995) . The double cross-validation design is shown in
Table 5.
The best 15 and 30 items selected by each item analytic procedure
in the first phase of the study, were submitted to a standard item
47
TABLE 3
DOUBLE CROSS-VALIDATIOM DESIGN OF TliE STUDY
SampleGroup Number^
NumberSelected
Item AnalyticProcedure
Double Cross-Validation Procedure
GrouD1
Group
Group.
1
2
3
4
5
6
7
8
9
250 Classical250 Factor Analysis250 Rasch
500 Classical500 Factor Analysis500 Rasch
995 Classical995 Factor Analysis995 Rasch
Factor AnalysisRaschClassical
RaschClassicalFactor Analysis
RaschClassicalFactor Analysis
The sample number is the same as referred to in Table 2.
Assignment to sample was based on a randomized 3X3 latinsquare procedure.
48
analysis program (GITAP) from wliich were obtained the dependent
variables in the study:
•indices of internal consistency as measured by the analysis
of variance procedure (Hoyt, 1941)
•the standard error of measurement
•item difficulty
•biserial correlations
By submitting the best 15 and 50 items selected by each item analytic
procedure in the study to a common item analysis program comparable
measures of the dependent variables were obtained.
Statistical Analyses
The third phase of the study focused on obtaining measures of
statistical significance for three of the dependent variables: internal
consistency, item difficulties, and biserial correlations. Only visual
comparisons were made for the remaining dependent variable, the standard
error of measurement. The internal consistency estimates from each test
were compared to the projected population value for tests of similar
length via confidence intervals as suggested by Feldt (1965). (Projected
population values were obtained using the Spearman-Brown Prophecy Formula.)
Item difficulties for the 15 and 50 best items were submitted to a
two-way analysis of variance, the two factors being sample size and item
analytic technique. This procedure was used to test for differences in
the types of items selected, in terms of item difficulty, by each technique.
If statistical significance was observed, with a = .05, Tukey's HSD
(honestly significant difference) post hoc procedure (Kirk, 1968) was
The analysis of variance procedure is appropriate only if thedistribution of the item difficulties and (transformed biserial correlations
49
employed to determine which item analytic teclinique(s) resulted in a
test with the highest item difficulties.
The biserial correlations were transformed to an interval scale
of measurement using a linear function of £ suggested by Davis (1946).
The linear transformation was based upon converting the biserial
correlation to z values, and then eliminating tlie decimals and negative
values of z by multiplying the constant 60.241 to each z_ value (Davis,
1946, pp. 12-15). Thus, the range of the transformed biserials ranged
between and 100. A two-way analysis of variance (sample size by item
analytic technique) was performed on the transformed biserial correlations
for the best 15 items. Tliis type of analysis was used to test for
differences in the types of items selected, in terms of biserial correla-
tions by each technique. If statistical significance was observed,
a = .05, Tukey's HSD post hoc procedure was employed to determine which
item analytic technique(s) resulted in higher transformed biserial
correlations.
The two-way analysis of variance and post hoc analysis, where
indicated, for the transformed biserial correlations was performed on
the 30 best items.
In addition to tests of statistical significance, a measure of
the efficiency of the 30 best items selected by each procedure was
compared for the sample of 995 subjects. Birnbaum (1968) defined the
relative efficiency of two testing procedures as the ratio of their
approximates normality and the variances are homogeneous (Ware and Benson,
1975) .
The analysis of variance procedure is appropriate only if the
distribution of the item difficulties and (transformed) biserial correla-
tions approximates normality and the variances are homogeneous (Ware and
Benson, 1975).
50
information curves. Lord (1974aJ lias described a procedure to compare
the relative efficiency of one test with another at different ability
levels. If two tests to be compared vary in difficulty, then the
relative efficiency of each will usually be different at different
ability levels (Lord, lC74b; 1977). In classical test theoiy it is
common to compare two tests that measure the same ability in terms of
their reliability coefficients, but this only gives a single overall
comparison, llie foiT.iula developed by Lord for relative efficiency
provides a more precise way of comparing two tests that measure the
same ability. The formula for approximating relative efficiency is
(Lord, 1974b, p. 248):
2
n c r ^ " X (n -x)f ., „.R.E. (y,x3 = y • X -x , (loj
n y (n -y)^2X ' ' y 'fy
where R.E. denotes the relative efficiency of y compared to x, n and n
denote the number of items in the two tests, x and y are tlie number-
2 2right scores having the same percentile rank, and f _ and f are the
squared observed frequencies of x and y. Lord lias suggested that
formula 13 only be used with a large sample of examinees and tests that
are not extremely short, hence this comparison was restricted to the
case where N = 995 and the 30 item test.
Tliree relative efficiency comparisons using the 30 item tests were
made: (a) the test based on factor analysis was compared to the test
based on classical item analysis, (b) the test based on the Rasch
analysis was compared to the test based on classical item analysis, and
(c) the test based on the Rasch analysis was compared to the test based
on factor analysis.
5]
Summary
An empirical study was designed to compare the effects of classical
item analysis, factor analysis, and the Rascli model on test development.
Item response data were obtained from a sample of 5,255 high school
seniors on a cognitive test of verbal aptitude.
The subjects were divided into 9 samples: three independent
groups of 250 subjects eacli, three independent groups of 500 subjects
each, and tlu-ee independent groups of 995 subjects each. The independent
groups were obtained so that tests of statistical significance could
be perfoi'ined.
The item response data were then analyzed in three phases. First,
the "best" 15 and 30 items were selected using each item analytic
technique. Under classical item analysis, the best 15 and 30 items
were selected based on the liigliest biserial correlations. For factor
analysis, the best 15 and 30 items were selected based on the highest
item loadings on the first (unrotated) principal component. The
selections of the best 15 and 30 items using the Rascli model were
based upon the mean square fit of the items to tlie model. These
procedures were used for each group of subjects. Second, a double
cross-validation design was employed to obtain estimates on the item
parameters for tlie best 15 and 30 items. The three item analytic
techniques were reassigned randomly to different samples of subjects
within each level of sample size. Then, the best 15 and 30 items
chosen by each method were submitted to a conmion item analytic procedure
in order to obtain estimates for comparing the three item analytic
methods. Third, a two-way analysis of variance and a Tukey post hoc
comparison test, when indicated, were used to test for differences in
52
the properties of items selected by each item analytic procedure. Also
confidence intervals were calculated to compare the internal consistency
estimates to a population value. In addition, tlie relative efficiencies
of the 30 item tests developed by eacl; item anal)'tic technique were
compared for the sample of 995 subjects.
••
CHAPTER IV
RESULTS
The study was designed to compare empirically the precision and
efficiency of tests developed using three item analytic techniques:
classical item analysis, factor analysis, and the Rasch model. The
following five hypotheses were generated to compare the three techniques:
1. There are no significant differences in the internal consistency
estimates of the tests produced by the three methods as the number of
items decreases when compared to the projected internal consistency
estimates for the population for tests of similar length.
2. There are no differences in the internal consistency estimates
of the tests produced by the three methods when the nunber of examinees
is decreased.
g3. There are no meaningful differences in the magnitude of
the standard error of measurement of the tests produced by the three
methods
.
4. There are no significant differences in the difficulties or
discriminations of the items selected by the three methods.
A meaningful difference was previously defined to be >_ 1.00.
53
54
5. Tliere are no differences across ability levels in the efficiency
of the tests produced by the tliree metliods.
The Verbal Ajjtitude subtest of the 1-lorida iwelfth Grade Test, was
used to test the hypotlieses. A sample of 5,235 examinees was
systematically selected from a population of 78,751. A demographic
breakdown of the sample by etl;nic origin and sex is presented in Table 4.
The data were analyzed and reported in the following manner: item
selection, double cross-validation, comparison of the 15 item tests on
precision and comparison of the 30 item tests on precision and efficiency.
These results were then summarized witli respect to tlie five h)-potheses.
Item Selection
The 50 items on the Verbal Aptitude subtest were submitted to each
of the tliree item analytic techniques. The means, medians, and standard
deviations of t'ne bi serial correlations and item difficulties, based
on classical item analysis, are presented in Table 5. These descriptive
statistics appear ecjuivalent across the varying sample sizes.
From tlie factor analysis, the percentage of total test variance
accounted for by the 50 verbal items on the first unrotated principal
component has been reported in Table 6. The percentage of variance
accounted for by the first principal component was obtained by summing
the squared item loadings and dividing by the total number of items.
The percentages of variance accounted for by the first principal component
in eacli sample were very similar. A check on the unidimensionality
of the test was made by rotating the principal components solution for
the sample of 995 subjects. Upon rotation, the results indicated
one dominate factor remained.
55
UJ
<CO
<O
oioa.
XUJc/0
Q
U
o[_>
3:
OQ<UJniCQ
u1—1
oUJQ
oH
O
c
•H
o
o
(Li U
ex b;
5 b
57
cri
3o•H
mHa
OOLOvOLOC^LOr-ILOrH^DOI^-^OOLOOOOOt-^t^l-OLOaiLO -cj- Tf LO >* LO ** "^ •^ K) ^ Ki 1-0 n rj fN] K) hO r-j rM rj i-H .—
I
LO .
2•H^1
1/1
•Hca
rj \o ra t^ LO .-I r~i o CO 00 00 rt \o' ri to o LO r^j LO ^ \o -^ >£> oo oo '^LO'r)-tOLor4io-^^ir!r:rtvoLO'^fOiot-o^rN(^i'^':tr)t^ ^^rn
X
3o•H
•HQ
^•^M-Lo^LO'^'*^t-ototot-OK)rMrsitot~or-icMt^)r-(i—(
i-t enLO >—
(
2 r^
•Hin
O(/I
00 r^ cn -^ r4 vo rt^ LO t-O LO lo ^o ^—lOtOi—iLOvoo-^toootot^a^oor-iL^LO^^OLOLO^O^o^o^o^^^^^to^^^rJ
00 O CO toLO "^ 1—
I
pa
>^M.—I
3O•H
r~~Oi—iLOto^vOi-H'Or^Lo^oioocTivoiotOLOcocD'^ooLO "* ^ LO ^ LO (N -Jj- l-O to to I~0 to I^ n rg lO t-O r4 f^J i-H r—I rM LO CM
O QLOnII
•Hfn
(U
•HCQ
e0)
>-Dtor--^DvoooLor-i.—i'*^ooovOLO'*LoiO(NO"^critO\oLO "^ r-i LO •—
I (M '^ •^ ^ to to lo * oi ^ "^ to CN to to '^ rj to
OOCr. Or-(0Jt0^L0v£)I^C0aiOr-HC4t0'^lOvDI^00Cr)Ori 04 lO to to to to to to to to to '^i- ^ rt- ^ -^ ^ ^ ^ * ^ LO
CO 00 LO'^f "* rH
58
H
H
59
c•H
CTl O
II
e2: (U
60
Three statistics ai-e reported in Table 7 for the Rasch item
analysis procedure. For each sample size, the percentage of total
variance accounted for by the first unrotated principal conponent
and the means and standard deviations of the mean square fit statistic
and Rasch difficulties are presented for the items selected.
In order to select the best 15 and 30 items from the Rasch analysis,
all 50 items were submitted to a principal components solution. Tliis
procedure was used to ensure that the items selected measured one
trait, as required by the assimiption of test unidimensionality. As
noted in Table 1 , the percentage of total test variance accounted for
by the first principal component, based on 50 items, was nearly equal
for each sample size.
From the principal components solution only items with loadings
between .39 and .79 were selected for the Rasch analysis as suggested
by Hambleton and i'raub (1971), to adhere to the assumption of equal
item discriminations. Using this procedure the number of items (out
of 50) retained for the Rascli analysis varied slightly with sample
size; when N = 250, 53 items were retained, when N - 500, 35 items
were retained, and when N = 995, 33 items were retained. Tliese items,
loading between .59 and .79, were then submitted to the Rasch analysis
to obtain mean square fit statistics and Rasch item difficulties.
These st;;tistics have been reported in Table 7.
Wriglit and Panchapakesan (1969) developed a measure to assess the
fit of the item to the Rasch model. Tlie measure, defined as the mean
square fit statistic, is:
61
3uH>+-
CI Qa
II LO +->
2 U-
04CH-ano
X
u•H
O -HO QLO
Cli
II CO +->
"TT C to
l-O 0-. t--
[-. O CO I r--
I
I >—
I
o r-
I o oo
1 ^ to 1
LO o1—1 rsj to
LO
1 ra •^ 1
1 ra ^ 1
63
64
2 _ k- 1 n 2
^^
i=i j=l ^ij • (14)
2The quanity, x > defined above has approximately the chi square
distribution with degrees of freedom equal to (k-1) (n-1). Tlie
value, y. . is the deviation of the item from the model, or item misfit,ij
and is determined by taking the difference between the observed and
expected frequency of the examinees at a given ability level who
answered a given item correctly. Tliis difference was then divided
by the standard deviation of the observed frequency, squared and
summed over items and score groups. The BICAL program standardizes
these deviations (y- .) in computing the mean square fit statistic;
therefore, y. . has a normal distribution with a mean of zero and
standard deviation of one (Hambleton et al., 1977). Items with large
mean square fit values are items which do not fit the model. As shovm
in Table 7, the mean and standard deviation of the mean square fit
statistic increased with sample size.
The item difficulty estimates based on the Rasch model also have
an expected mean of zero and standard deviation of one (Wright and
Mead, 1975). These estimates remained very similar across sample
size and exceptionally close to the expected values (Table 7).
The Rasch model does not provide a parameter for item discriminating
power as all item discriminations are considered equal and centered
at one (Wright and Mean, 1975). Tlie BICAL program provided, as part
of the normal output, estimates of the item's discriminating power
to check the fit of the data to the model. The item discriminations
were obtained by regressing the difficulty of the item for each ability
group on the ability estimate of the group (Wright and Mead, 1975, p. 11).
65
Tlie means and standard deviations for the item discrimination estimates
were shown in Table 8 for each sample size.
TABLE 8
DESCRIPTIVE DATA ON ITEM DISCRIMINATION ESTIMATESBASED ON THE RASCH MODEL ACCORDING
TO SAMPLE SIZE
N = 250^
K = 53bN = 500K = 35
Mean
StandardDeviation
1.03
.28
1.02
.19
N = sample size
K = number ot items
N = 995
K = 33
1.03
.22
From the data in Table 8, the mean item discrimination estimates
appear nearly equal for each sample size, and quite close to the mean
expected value of one.
The best 15 and 30 items were then selected by each item analytic
procedure based on the information in Tables 5-7, and have been listed
in Tables 9 and 10 respectively.
The items selected under classical item analysis were determined by
the magnitude of the biserial correlation, e.g., the 15 and 30 items
having the higliest biserial correlations with total test score were
selected. Indices of item difficulty have been reported for inspection,
but in no way influenced the selection of items for classical item
analysis.
66
67
rtwhHLO
04J
§CO
o
zaou<w
QPJuo
oI—
(
W <J< pj
3:
<w
wQZ
QWHC_)wJw
4::
(/I
a:
1/1
o >.4-1 ,-1
^ s
X XXXXXX XXXXXXXXXXXXX X
XXXXXXXXXXXXXXXXXXXX XXX
CTl
Oi
IInj
Z -Hin
r—
(
u
s
XXXXXX XXXXXXX XXXXX XXX
r-Ht—(.—!—(rHr-Hi—ii-HrHi-Hr-grsir-irjcirsi
oo
CJ
t/1
a;
fn t/)
O >^•(-> --I
O 03
03
oM</)
(/)
o3
XXXXXX XXXXXXXXXXX XXXXX
XX XXXXXXXXXXX XX XXXXXX
PJ
o
HCOwoa
PJ
u
6
XX XXXXXXXXXXX XXXXXXXX
(NtO'^i-0\£>t^coa)Or-irsit^'*Lr>voi^ooo^o(N'<*Ln\Dr-Hr-Hr-HrHi—li-Mi—li—(r-|.-|CNCNirMrv|(N
atn
ot->
oO o3
XX XXXXX XXXXXXXX XXXX
XXXXXXXXXXXXXXXXXXXX XX
II
c3
o•H(/)
o3
6
XXX XXXX XXXXXX XXXXXXX
,o^u..Dr^coa.o^r.K,2^;Or^oo2S?:lS^S
t.8
CH(->
cou
1
o
UJ
<
o
II
|i5
O1/1
Ho >.+-> ^u rt
c3 C
rt
o•H1/)
00
u
69
The selection of items under factor ai^alysis was determined by
the item loadings on the first unrotated principal component. The
15 and 30 items having the highest item-component biserial
correlation were selected.
The selection of the 15 and 30 items from the Rasch analysis was
determined by the mean square fit of the item to tlie Rasch model. The
closer the mean square fit was to zero the better the item fit the
model, thus items with tl\e lowest mean square fit statistic were
selected.
Double Cross-Validation
After the tests of the best 15 and 30 items were developed by each
procedure, they were scored on independent samples, in a double cross-
validation procedure as noted in Table 3, Chapter III. Item and
test statistics, needed to test the five hypotheses were obtained for
the 15 and 30 item tests based on the cross-validation samples using the
GITAP program (Baker and Martin, 1970).
The GITAP program provided the following output:
each subject's total test score
test mean and standard deviation
. internal consistency estimates as measured by Hoyt's
analysis of variance procedure
. estimates of the standard ei'ror of measurement
. indices of item difficulty and biserial correlations
Comparison of the 15 Item Tests on Precision
The descriptive statistics based on the double cross-validation
samples for the 15 item tests have been presented in Table 11.
70
wpa<
t—
1
tu
c/5 a,w sCQ <
COLO oHa:
O C£ioQ UW CJCO <CO
oCJ
U4
Q
CO oPJ oi
PJ U
o
LOuHCO
PJ
< UE- <CO PJ
PJ >-> pa
H QCl. PJH-l HDi CJU PJCO -JPJ PJD CO
.a
71
The values of the iiiternal c^^isistency estimates for the tests
developed using the Rasch model were consistently lower than the
internal consistency estimates of the tests developed by classical
item analysis and factor analysis across all sample sizes.
Tlie observed internal consistency estimates were tested for
significance using confidence intervals described by Feldt (1965), to
see if they were statistically different from the internal consistency
estimate for the projected population using the Spearman-Brown Prophecy
Formula.
The internal consistency estimate for the population based on the
original 50 item subtest was .88 (Table 1). By applying the
Spearman-Brown Prophecy Formula (Mehrens and Lehman, 1975) the projected
population internal consistency estimate for a 15 item test was found
to be .687. The value .687 was the expected internal consistency if 55
of the 50 items were randomly deleted. Thus, confidence intervals
were generated around the observed internal consistency estimates,
presented in Table 11, for each procedure across all sample sizes to
see if any of the three item analytic teclmiques would produce a more
reliable test than would be expected from mere random item deletion.
The confidence intervals for the observed consistency estimates
for each procedure have been reported in Table 12.
When tlie sample sizes were 250 and 995 eacli item, analytic technique
produced an internal consistency estimate that was significantly different
from the projected population estimate (.687) at a confidence level of
95 percent. Each of the tl;ree techniques systematically retained
the 15 most homogeneous items. These tests were more precise
in terms of internal consistencv than would have been found if the
72
items were randoml)- deleted as noted by comparisons to tlie projected
population reliability coefficient.
Table 12
CONFIDENCE INTERVALS^ FOR THE OBSERVED INTERNALCONSISTENCY ESTIM/\TES BASED ON THE 15
ITEM TESTS ACCORD I F^G TO SAMPLE SIZE
95 V Confidence Interval
Procedure N = 250 N = 500 N = 995
Classical .748 - .828* .792 - .838* .810 - .841
Factor Analysis .760 - .837* .786 - .854* .797 - .831
Kasch .704 - .799* .688 - .757 .711 - .759
*
*
*
The £ values used in calculating the confidence intervals were obtainedfrom f'larisculo (1971).
+
Statistical significance is indicated when tlie population internalconsistency estimate is not concluded in tlie confidence intervalgenerated for each observed internal consistency estimate. Theprojected population value was .687.
Only two procedures produced tests with internal consistency
estimates significantly different from the projected population estimate
when tlie sample size was 500, classical item analysis and factor
analysis
.
As sample size decreased, in most cases, the internal consistency
for each method tended to decrease (Table 11). An exception was noted
for the Rasch tests, when the sample size decreased from 500 to 250,
internal consistency improved slightly.
The data reported in Table 11 indicated that the standard error
of measurement for the 15 item tests based on the Rasch model were
75
consistently larger than the stanu.ird error of nicasurcincnt of the
tests developed from classical item analysis and factor analysis for each
sample size. However, these differences were not meaningful in that
the difference did not equal or exceed 1.00 for any of the three
procedures.
The differences in mean item difficulties and discriminations
were tested for statistical significance to determine whether there were
differences in the types of items retained by eacli item anal}-tic method.
In tliis study item discriminations were measured by biserial correlations.
A two-way analysis of variance (fixed effects model) was performed
separately for the two dependent variables of item difficulty and item
discrimination. A check was made on the assumptions for the analysis
of variance to ensure that they were met. In tliese analyses, item
analytic technique and sample size were the two independent factors, each
with three levels.
For item difficulty, no significant differences were found for
item analytic technique, sample size, or their interaction, F (2,1261 =
2.57, p > .05; F (2,126) = .45, £ > .05; £ (4,126) = .33, £ > .05
respectively.
The means, standard deviations, and ranges of the item difficulties
based upon the 15 item tests have been reported in Table 13.
For the analysis of variance performed on the transformed
biserial correlations a significant F. ratio was observed for the factor
of item analytic technique, F (2,126) = 14.862, p < .05. No significant
differences were observed for sample size or the interaction of sample
size and item analytic technique for the transformed biserial
74
correlations F (2,126) - .30, p > .05; F C4,126) = 1.16, £ > .05
respectively. The means, standard deviations, and ranges of the
transformed biserial correlations, based upon the 15 item tests have
been presented in Table 14.
TABLE 13
15 ITEM TESTS:DESCRIPTIVE STATISTICS FOR ITEM DIFFICULTY
BY PROCEDURE AND SAMPLE SIZE
75
between tlic mean item discriminations liave been reported in Table 15,
TABLE 14
15 ITEM lESTS:
DESCRIPTIVE STATISTICS FOR ITEM DISCRIMINATIONS'
BY PROCEDURE AND SAMPLE SIZE
Mean
Procedure
Classical
FactorAnalysis
55.60 54.49
Rasch
43.51
Standard
76
TABLE 15
POST HOC COMPARISONS OF THE DIFFERENCES
BETWEEN THE MEAN ITEM DISCRIMINATIONS^FOR 11 IE 15 ITEM TESTS
Means
54.49 53760 43.51
Factor Analysis .889 10.98**
(54.49)
Classical ItemAnalysis 10.09**
(53.60)
Rasch Analysis(43.51)
^Based on transformed biserial correlations. Tlie transformation was a
linear transformation of the Fisher z_ statistic and multiplication
of the constant 60.241 providing a range of 0-100 for the biserial
correlation (Davis, 184b).
**£ < .01, HSD =6.88.
By increasing the test length to 30 items, the internal consistency
estimate was increased across each method and sample size, but a pattern
similar to that for the 15 item test emerged. The internal consistency
estimates from the test based on the Rasch model were slightly lower
than the internal consistency estimates for the tests based on
classical item analysis and factor analysis. The observed internal
consistency estimates were tested for significance, using the confidence
intervals described in the previous section, to see if they were
statistically different from the internal consistency estimate for the
population.
The projected population internal consistency estimate for a 30
item test was found to be .814 (via the Spearman-Brown Prophecy Formula).
77
<
en wPJ t-H
H c/3
t—
I
w
CQ .^
oto
XI—
1
O CCOa uuj uCO <oa, PJ
CO
oH
OU Qtu
H UCO OPJ a:
o <PU z
u
I—
I
H<
PJ
CO uj
PJ >•> CQI—
I
H Q&, PJ
Pi UU PJCO -JPJ PJ
78
The value of .814 indicated tlio expected internal consistency
if 20 of the 50 items were randomly deleted.
Based on tlie observed internal consistency estimates reported
in Table 16, confidence intervals were generated for each item analytic
procedure and have been presented in Table 17.
TABLE 17
CONFIDENCE INTERVALS^ FOR THE OBSERVED INTERNAL
CONSISTENCY ESTIMATES BASED ON THE 30
ITEM TESTS ACCORDING TO S.AMPLE SIZE
95"b Confidence Interval
N = 250 N =500 N = 955
Classical .800-. 865 .845-. 880* .850-. 874*
Factor Analysis .825-. 881* .834-. 871* .848-. 873*
Rasch .794-. 860 .817-. 858* .838-. 863*
Tlie ¥_ values used in calculating the confidence intervals were obtainedfrom~Marisculo (1971).
*
Statistical significance is observed when the population internalconsistency estimate is not included in the confidence intervalgenerated for each observed internal consistency estimate. Tlie
projected population value was .814.
For the sample of 250 examinees, only one item analytic technique
(factor analysis) produced an internal consistency estimate that was
statistically different from the projected population estimate at a
confidence level of 95 percent.
However, all three techniques produced tests with internal con-
sistency estimates significantly different from the projected population
•
79
estimate when the sample was increased to 500, and 995. Thus, when
the number of examinees was large, each of the three teclmiques
procedures tests with higher internal consistency estimates than if the
test were produced by randomly deleting items.
For the 30 item tests, the effect of decreasing the sample size
tended to decrease internal consistency for each method (Table 16)
.
But the decrease was very slight.
The standard error of measurement was essentially tlie same for
the three methods of item analysis across the varying sample sizes.
Two-way analyses of variance were run on item difficulties and
item discriminations for the 30 item tests, similar to those run for
the 15 item tests. Again, the independent variables were item analytic
technique and sample size, each containing tliree levels.
No significant differences were observed for item difficulty
for the independent variables of item analytic technique, sam.ple size,
or their interaction, F. (2,261) = .46, p > .05; l_ (2,261 = .27, p > .05;
F_ (4,261) = .24, p > .05 respectively.
No significant differences were observed for the transformed
biserial correlations for the independent variables of item analytic
technqiue, sample size, or their interaction, F. (2,261) = 1.97, £ > .05;
F (2,261) = .74, £ > .05; F_ (4,261) = .48, £ > .05 respectively.
When the actual biserial correlations were tested in the two-wayanalysis of variance design similar F values were observed.
SI)
The means, standard deviations, and ranges of tlie item difficulties
and transformed biserial correlations based upon the 30 item tests
have been presented m Tables 18 and 19 respectively.
TABLE 18
30 ITBl TESTS: DESCRIPTIVE STATISTICS FOR ITEM
DIEFICULTY BY PROCEDURE AND SAMPLE SIZE
SI
Comparison of the 3U Item Tests on nfficiency
Lord (1974a, 1974b) proposed the formula used for approximating
the relative efficiency for tvv-o tests, stated previously in equation
15 as:
V . xCn - xj f
R.E. Cy,x) = -^^- ^ ^ '
X y(ny - y)f^
where R.E. (y,x) denotes the relative efficiency of y compared to x,
n and n are tlie numbers of items in the two tests, x and y are the
2'^
number-right scores having the same percentile rank, and f and f are' X y
the squared observed frequencies of x and y obtained from frequency
distributions for similar groups of examinees. A careful examination
of the formula for relative efficiency indicated tliat when n = n andX y
X = y, that it was the number of examinees at the specified ability
2 2level (f and f ) that determined the efficiency of the test. That is,
X y ' '
the fewer examinees observed at a particular percentile rank, the better
the test discriminates at that percentile rank. Therefore, test
efficiency was equated with the level of discrimination the test
was able to make between examinees, at various scores or percentile
ranks.
Three relative efficiency comparisons were made using the 30 item
tests based on the sample of 995 examinees. The three comparisons were:
[a) tlie test developed from factor analysis was compared to the test
developed by classical item analysis (b) the test developed by Rasch
analysis was compared to the test developed by classical item analysis,
and (c) the test developed from the Rascli analysis was compared to the
factor analytically developed test.
Hie efficiency curves for the thi-ce comparisons were shown in
Figure 2. The relative efficiency value was plotted on the ordinate,
while the percentile rank (student ability level] was plotted
along the abscissa. Computed values for the relative efficiency
comparisons liave been reported in Appendix B. A relative efficiency
of 1.00 would indicate that the tests are equally efficient.
T\\e test developed by factor analysis was more efficient for the
lower tenth of the pupils when compared to the test developed from
classical item analysis. Both the tests were about equally efficient
for the middle ability groups and high ability groups,
llie Rasch developed test was more efficient than the test based
on classical item analysis for average to high ability students
(40th-90th percentile rank). However, it was less efficient than the
classical item analysis test for students with very low or very high
abilities (lst-20tl"i percentile rank and 98th percentile rank].
When compared to the factorial ly developed test, the Rasch test
was again more efficient for students of average to high abilities
(50th-90th percentile rank] . The factorially developed test appeared
more efficient for the very low and very high ability students
(lst-20th percentile rank and 98th percentile rank).
Suimnary
Tlie results reported in this chapter are suiranarized for each of
the five hypotheses.
Hypothesis 1 . There are no significant differences in the
internal consistency estimates of the tests produced by the tliree
methods, as the nimiber of items decreases, when compared to the
83
20 30 40 50 60 70 80
PERCENTILE RANKS
1^
SO 99
key;
84
projected internal consistency estimates for the population for tests
of similar length.
Confidence intervals were calculated to test for differences between
the observed internal consistency estimates and the internal consistency
estimate for the population. As reported in Tables 12 and 17, for the
15 and 50 item tests, 15 of the 18 confidence intervals (at the 95 percent
level) generated around the sample estimate did not contain the population
value. This means that 15 of the observed internal consistency estimates
were superior to the population values projected for subtests of similar
length created by random deletion of items. Therefore, hypothesis one
was not supported. The procedures that produced the three observed
internal consistency estimates that were not significantly different
from the population value, and hence no different than would be expected
by random item deletion, were the Rasch procedure (15 item test, N = 500;
50 item test, N = 250) and the classical item analysis procedure (50
item test, N = 250)
.
Hypothesis 2 . There are no differences in the internal consistency
estimates of the tests produced by the three methods when the number
of examinees is decreased.
Hypothesis two was supported for the 15 and 50 item tests. Slight
decreases in internal consistency estimates were noted for the 15
item test (Table 11) as sample size decreased, but only decreases of
one or two one-hundreths of a point. Even smaller decreases were
observed on the 50 item test (Table 16)
.
H>T3othe5is 5 . There are no meaningful differences in the
magnitude of the standard error of measurement of the tests produced
by the three methods.
85
Hypothesis three was supported for the 15 and 30 item tests.
Meaningful differences were defined to be >_ 1.00, but none of the
three methods produced tests with standard errors of measurement that
differed by tliat much. In eacli case, the difference was approximately
one-tenth of a point or less (Tables 11 and 16)
.
Hy|:)othesis 4 . There are no differences in the difficulties or
discriminations of the items selected by the three methods.
Hypothesis four was supported for the 15 and 30 item tests with
respect to item difficulty. That is, the two-way analysis of variance
revealed no significant differences for either the 15 or 30 item tests
with regard to item difficulty.
Hypothesis four was also supported for item discrimination, but
only for the 30 item tests. Tlie two-way analysis of variance for item
discrimination indicated no significant differences for the 30 item
tests; however, on the 15 item tests, a significant £ ratio (p^ < .05)
for item analytic procedure was observed for item discrimination.
Tukey's HSD test revealed tliat items selected by the Rasch procedure
had significantly lower average biserial correlations than the items
selected by factor analysis and classical item analysis (Table 15)
.
This could have been expected because the range of the biserial
correlation was restricted when the items were originally selected
for the Rasch model. This procedure was necessary to meet one of the
assumptions for the Rasch model.
Hypothesis 5 . There are no differences across ability levels in
the efficiency of the tests produced by the three methods.
86
H)'pothesis five was not supported. The efficiency curves
illustrated in Figure 2, generally indicated tliat the tests based on
classical test theory were more effective for measuring students with
very low ability (20th percentile rank or less) and students with very
high abilities (98th percentile rank), llie Rasch developed test was
most efficient for assessing average and high ability students (40th-
90th percentile rank).
CHAPTER V
DISCUSSION AND CONCLUSIONS
This study was conducted to determine which of the three item
analytic procedures (classical item analysis, factor analysis, and the
Rasch model) miglu produce the superior test in terms of the precision
and the efficiency of measurement. A common item and examinee population
was used to test five hypotheses. Of the five hypotheses, three dealt
with elements of test precision as measured by internal consistency
estimates. Another hyjriothesis treated the issue of item discriminations.
Thus, it too was related to internal consistency. The fifth h>qiothesis
focused on the relative efficiency of the tests produced by tliree item
analytic teclniitjues. Tliis hy]oothesis altered the emphasis of the study
from one overall specific measure of a test's accuracy, in terms of
internal consistency, to a general comparison of each method as a
function of ability level. The discussion of the results then has been
focused in two major areas: (a) the precision of the tests, and [b)
the efficiency of the tests produced by the three methods of item analysis.
The Precision of the Tests Produced by the
Three Methods of Item Analysis
Each of the three item analytic techniques was applied to an in-
dependent sample to select the best 15 and 30 items. The stability of
the summary statistics across each sample size for the three item analytic
techniques indicated a tendency for tlie nine samples to be very homogeneous,
87
88
The similarity of the means, standard deviations, and percentages of
variance accounted for were noted on Tables 5-8, witli tlie exception
of the mean sciuare fit statistic (Table 7) whicli increased with
sample size. (Tlvis exception is discussed later in this cliapter.)
From these samples, items were selected by each item analytic techniciue
to maximize internal consistency.
The data reported in Tables 11 and 16 indicated the effectiveness
of each item analytic technique in producing internally consistent
tests. Before an overall decision can be made as to the superiority of
one technique over another, each of the h)q30theses relating to precision
must be considered.
Internal Consistency
Data in Tables 11 and 16 indicate that tlie two tests based on
classical test tlieory (factor analysis and classical item analysis)
appeared superior in terms of internal consistency wlien compared to the
tests developed by tlie Rasch model.
To test whether any of the three methods produced tests with
greater intei'nal consistency than a test created by random item deletion,
the internal consistency estimates were compared to the projected internal
consistency value for the population by using confidence intervals as
suggested by Feldt (1965). In order for a given sample internal consis-
tency estimate to be significant, the population value could not be
included in the confidence interval generated around that sample value.
For the 15 item tests, nine confidence intervals were calculated for
the nine estimates of internal consistency, one for each method at each
sample size. Eight of the nine sample values were shown to be significantly
greater than the population estimate at the 95 percent confidence level
(Table 12). Only tlie internal consistency estimate of the Rascli test.
80
based on the sample of 500 examinees, failed to reach a level significantly
greater than would have been expected by chance.
For the 30 item tests, nine confidence intervals were also calculated
for the nine estimates of internal consistency, one for eacli method
at each sample size. Seven of the nine sample internal consistency
estimates were shown to be significantly greater than the population
estimate at the 95 percent confidence level (Table 17) . The tests
based on classical item analysis and Rasch analysis, for the sample of
250 examinees, were not significantly different from the projected
population value for a 30 item test created by random item selection.
Therefore, for smaller samples (N = 250] factor analysis appeared
to be superior to classical item analysis and the Rasch analysis in
producing the most precise test.
Generally, as tlie number of examinees decreased so did the internal
consistency estimates. However, the tests based on factor analysis were
least affected by decreasing the sample sizes used in the cross-validation
for the 15 and 30 item tests (Tables 11 and 16).
Standard Error of Measurement
The standard error of measurement is the standard deviation of the
distribution of errors surrounding an individual's observed score on
an infinite number of parallel tests. Hence the smaller the standard
error of measurement, the greater the precision of the measurement. This
statistic is often considered a more meaningful measure of an instrument's
reliability than tlie reliability coefficient itself (Magnusson, 1966,
p. 82). Based on the data for this study, the standard errors of
measurement were consistently smaller for both the 15 and 30 item tests
90
produced by classical test theory as compared to the 15 and 30 item
tests based on the Rasch model; however, the differences in the standard
errors of measurement did not equal or exceed 1.00 for any of the
methods.
Types of Items Retained
Item difficulty . Item difficulties of the 15 and 50 item tests
were analyzed in separate two-way analyses of variance. The two
independent variables were sample size and item analytic technique.
No significant £ ratios were observed for eiti\er the 15 or 30 item
tests on item difficulty. Therefore, each item analytic technique
tended to select items whicli had similar item difficulties on the
average.
Item discrimination . In this study, the item discriminations were
measured by biserial correlation. Transformed biserial correlations
for the 15 and 30 item tests were analyzed in separate two-way analyses
of variance. The two independent variables were sample size and item
analytic technique. For the 15 item tests, a significant F ratio
(£ < .05) was observed for the main effect of item analytic technique.
The mean biserial correlation for each 15 i tem test s were 44 for the
Rasch test, 54 for the classical item analysis test, and 54 for the
factorial ly developed test. Tukey's USD post hoc comparison
indicated that the items selected by the Rasch procedure had lower
biserial correlations, on the average, than items selected on the basis
The actual mean biserial correlations for the three testscorresponding to the transformed biserial correlations were:.62, .71, ,71 respectively.
91
of factor analysis or classical i t om analysis. It should be noted
that this finding was due to the fact that the range of tlie biserial
correlations was restricted to .33 - .79 on the items selected for
the Rasch calibration. This was necessary to meet the assumption
of equal item discriminations.
However, when the test length was increased to 50 items, no
significant ¥_ ratios were observed for the variable of item discrimina-
tion. The difference in these two findings for the 15 and 50 item
tests can be explained by the way the 15 and 50 item tests were constructed.
The 15 item test was made up of the 15 items with the highest biserial
correlations. The 50 item test was made up of the above 15 items and
an additional set of 15 items with the next highest biserial correlations.
The addition of 15 more items meant that their average biserial correla-
tion was something less than the original 15 items. Hence, the mean
biserial correlations were reduced for the longer 50 item tests
[Table 19)
.
Conclusions
From the data presented for each of the four areas above, it was
concluded that each of the three item analytic techniques tended to
produce tests that were really no different in terms of the precision
of measurement.
Thus, the question to consider now is: Should practitioners
in the field of measurement spend their time learning to use the Rasch
12model to develop cognitive norm-referenced tests knowing the extra
11. . r
"Tlie field of test development is limited to cognitive norm-retcr-
enced tests because tliat was the type of instrument used in this study.
92
work and sopliisticatioii of knowledp,e required to effectively use the
Rasch procedures? Witli the criterion of internal consistency as a
measure of test superiority, it appeared from this study that time
spent factorial ly developing tests, or if computer facilities were not
available, tlie use of classical item analysis procedures seem more
than adequate for good test construction.
However, it must be remembered that internal consistency may
not be a fair and sufficient criterion. Internal consistency is an
integral part of classical test theory and may be biased since it was
derived from the classical model. Following tliat reasoning, Whitely
and Dawis (1974) liave commented on the precision of tests developed
using classical item analysis and Rasch analysis. They stated that if
the goal of item selection was to develop fixed-content tests, then
the classical teclmiques of item selection will yield the more precise
test since precision is specific to tlie trait distribution in a given
test. Whitely and Dawis indicated that tlie strength of the Rasch
analysis was in the individualized selection of items, as in tailored
testing, rather tlian the construction of fixed-content tests.
In considering the above situation, Lord (1974b) has stated that
internal consistency is an overall estimate of a test's homogeneity,
but provides no information on how the test as a whole discriminates
for the various ability groups taking tlie test. Thus, the three
techniques of test development were compared using an additional criterion,
relative efficiency.
93
Tlic F.ffjcicncy of ; Tests Produced by the
Tliree Methods of Item Analysis
The review of the literature concerning the relative efficiency
of a test presented in Chapter 1I-, cited studies mainly dealing with
latent trait theory (Birnbaum, 1968; Hambleton and Traub, 1971, 1973).
Studies comparing the relative efficiency of tests developed by latent
trait theory to classical test theory appear to be missing from the
literature on test efficiency.
The relative efficiency formula (Lord, 1974a, 1974b) was not
derived for any specific test development theory; therefore, relative
efficiency estimates should be applicable to any test development
technique. Lord (1974b) suggested his formula may not work well for
extremely short tests and that it should only be used on large samples
of examinees. Thus, in the present study, only the 30 item tests were
compared using the sample of 995 examinees. The three comparisons of
relative efficiency were: (a) the test based on factor analysis was
compared to the test based on classical item analysis, (b) tlie test
based on the Rasch analysis was compared to the test based on classical
item analysis, and (c) the test based on the Rasch analysis was compared
to the test based on factor analysis.
Generally, the results indicated that the Rasch test was superior
to the two tests based on classical test theory for students of average
and high ability. The two tests based on classical test theory, how-
ever, were superior in efficiency to the Rasch developed test for very
low and very high ability students (Figure 2). Test efficiency has been
defined as a measure of how well a test is able to discriminate between
examinees of varying abilities. Therefore, the test constructor must
94
ask himself, for which segment (s) of the examinee population is the
test intended to discriminate?
In this study, tlie test under consideration was a verbal aptitude
college admissions test. Usually college admissions officers are
interested in selecting students who will be successful once admitted
to college. The examinees wtio score very high on college admissions
tests will generally be admitted to college witliout any question. Thus,
it is less important to be able to discriminate among the very high
scoring examinees tlian to discriminate among the students who score near
the mean or in the upper middle range on a college admissions test.
For these students it is difficult to decide who sliould be admitted
and who sliould be denied admittance. If it is known that the admissions
test discriminates very well for average to high ability students, then
tlie reliability of the selection process based on test scores should
be increased. The data in this study, therefore, indicate tliat the
test based on the Rasch analysis would be most efficient for selecting
the average to 'nigh ability students for admission to college.
Lord (196S) has illustrated a very important feature of test
information and relative efficiency curves. He has shown tliat the
contribution of each item to a test is independent of all other items.
Thus, when information curves are available on a pool of items they can
be added to a test to achieve a prespecified information or relative
efficiency curve for any subpopulation of examinees (Lord, 1968)
.
Therefore, by using measures such as relative efficiency and information
curves psychometricians are able to develop very discriminating tests
for any segment of the population.
9 5
Conclusions
It has been suggested that because efficiency and test information
curves are a function of ability, tliese estimates ought to replace
the use of classical reliability estimates and the standard error of
measurement in test score information (Hambleton et al., 1977). This
suggestion certainly deserves some consideration in light of the present
study where it was shown that the three methods of item analysis
produced similar tests in terms of precision, but the three methods
produced very different tests in teimis of efficiency. Today, with
the increasing use of computers in test construction, perhaps the
more meaningful question to be asked by psychometricians is; For
which ability group is the test superior? Only test information curves
and measures of relative efficiency can answer that question.
Implications for Future Research
The results of this empirical study revealed that tests developed
using classical test theory, in spite of its inherent weaknesses, were
no different with respect to precision of measurement than tests
developed using one of the latent trait models, the one-parameter
Rasch model. Comparisons of relative efficiency for the 30 item tests
showed that the tests based on classical test theory were superior to
the Rasch developed test for very low and very high scoring examinees,
and the Rascli developed test was more efficient for average to high
scoring examinees on the verbal aptitude college admissions subtest
used in this study.
Only one of the four latent trait models, the Rasch model,
was used in this comparative study of test development techniques.
Perhaps it was the very nature of tliis simple model that resulted in the
96
devc lo])mcnt of ctjuivalcnt tests when compared to tlic tests developed
by classical item analysis and factor analysis in terms of precision.
It may be that the more technical- two- and three-parameter logistic
models would have produced tests comparable or superior to those
developed by classical test theory in terms of precision of measurement
and overall relative efficiency.
The two-parameter model allows for varying item discriminations so
that the initial selection of items would not have to have been restricted
to a prespecified range. The three-parameter model not only allows
for varying item discriminations, but also for the effects of guessing
on the test. It is reasonable to suspect that guessing may have been
a factor in the item scores for the t>T3e of cognitive test used in the
present study. Thus, before the findings of the study can be generally
accepted, replication is needed using not only other populations and
other instruments, but also other latent trait models.
If other latent trait models are to be considered in addition to
the Rascli model, several points need to be evaluated. The Rasch model
is the only latent trait model that provides for the direct calibration
of items and abilities based on unweighted "number right" scoring
(Wright, 1977). The two- and three-parameter models require a more
complex scoring system where the item response is weighted in order to
estimate the discrimination and guessing parameters. The weigliting
is an iterative process that may never converge or stablize unless
arbitrary boundaries are established [Wriglit, 1977, p. 104). Dccause
of tliis complex scoring system, the two- and thrcc-iniramcter logistic
models are less efficient for parameter estimation than the one-jiaram-
eter model in terms of computer time.
97
A further criticism of the two- and tiiree-parameter logistic models
has been offered by Wright (1977) concerning the additional item
parameters. If item discrimination parameters and item guessing
parameters are introduced into a theory of measurement, why not
person parameters for sensitivity to difficult items, and inclination
toward guessing? Wright questions wliether psychometricians really
need to make measurement theory so complex.
A final point to consider when comparing the latent trait models
is that only the one-parameter Rasch model provides a ratio scale
of measurement in terms of the calibrated item and ability scores
(Hambleton et al., 1977). Success on a particular item is given by
the product of the person's ability and the item's easiness. Tlius, the
person with no ability will have zero odds or probability of success
on any item. The same logic applies to items with no easiness (zero
difficulty) they cannot be solved. Thus, measurements made with Rasch
calibrated items are on a ratio scale; and it is the ratio scale of
measurement that leads to the concept of specific objectivity.
Tlierefore, it seems that each latent trait model has its own
set of advantages and disadvantages and sliould be considered if com-
parisons are to be made to the classical models for test development
purposes.
Additional areas for future research may lead to actual comparisons
of the content of the items selected by each of the three item analytic
techniques. Davis (1951) and Co.\ (1965) have criticized the selection
of items solely on statistical criteria. The use of statistical criteria
alone, may result in changing the nature of the trait being measured
by deleting the items essential to adequate coi;tent coverage. Whitely
98
and Dawis fioyb) found this to be true in a study of verbal analogy
items wliere the type of relationship and content of specific analogy
items proved to be quite significant when studied in isolation.
A restricting factor in this stud)' was tliat a prespecified
number of test items were selected, e.g., 15 and 50, by each item
analytic procedure. For the Rasch procedure, perhaps some number of
items less than 30, but greater than 15 may have provided a better fit
to the model. This procedure might have produced mean square fit
statistics that were equivalent across the sample sizes rather than
increase witli sample size as found in this study. Thus, selecting
the number of items precisely fitting the Rasch model and comparing
that number of items with classical item analysis and factor analysis
miglit liave led to different conclusions than those made in the present
study with regard to the precision and relative efficiency of measurement,
CMPTF.R VI
SlIffMARY
The quality of the items in a test determine its validity and
reliability. Tlirough the application of item analysis procedures, test
constructors are able to obtain quantitative objective information useful
in developing and judging the quality of a test and its items.
Classical test tlieory forms the basis for one method of test
development. An integral part of the development of tests based on the
classical model is the utilization of classical item analysis or factor
analysis. Classical item analysis is a procedure to obtain a description
of the statistical characteristics of each item in the test. This
approach requires identification of single items whicli provide maximum
discrimination between individuals on the latent trait being measured.
Theoretically, selecting items which have a high correlation with total
test score will result in a discriminating test which is homogeneous
with respect to the latent trait. Therefore, classical item analysis
is an aid to developing internally consistent tests.
An alternative metliod of test development, but based on the
classical model, is factor analysis. Factor analysis is a more complex
test development procedure than classical item analysis. It is a
statistical technicjue that takes into account the item correlation with
99
100
all other individual items in the test simultaneously. Groups of
similar items tend to cluster together and comprise the latent traits
(factors) underlying the test. Thus, under the classical model then,
classical item analysis can be viewed as a unidimensional basis for
item analysis, less sophisticated than the multidimensional procedure
of factor analysis.
Classical item analysis and factor analysis have long been the only
techniques described in measurement texts for use in test development
(Baker, 1977). However, with the publication of Lord and Novick's
Statistical Theories of Mental Test Scores,
(19b8) considerable attention
is being directed now toward the field of latent trait theory as a new
area in test development. Proponents of this approach claim that the
advantages of latent trait theory over classical test theory are
twofold: (a) theoretically it provides item parameters that are in-
variant across examinee samples which will differ with respect to the
latent trait, and (b) it provides item characteristic curves that give
insight into how specific items discriminate between students of
varying abilities.
Four latent trait models have been developed for use with
dichotomously scored data: The normal ogive, and the one-, two-, and
three-parameter logistic model (Hambleton and Cook, 1977; Lord and
Novick, 1968). This study was concerned with the one-parameter logistic
Rasch model because it is the simplest of the four models.
A review of tlie literature revealed numerous studies conducted
in each of the tliree areas of item analysis, but relatively few
comparative studies were reported between the three methods. Missing
101
from the review were comparative studies among all three, item analytic
techniques. Therefore, the present study was designed to compare the
methods of classical item analysis, factor analysis, and the Rasch model
on measures of precision and relative efficiency in test development.
An empirical study was designed to compare the effects of the three
methods of item analysis on test development across different sample
sizes. Item response data were obtained from a sample of 5,235 high
school seniors on a cognitive test of verbal aptitude.
The subjects were divided into 9 samples: three independent
groups of 250 subjects each, three independent groups of 500 subjects
each, and three independent groups of 995 subjects eacli. The independent
groups were obtained so that tests of statistical significance could
be performed.
The item response data were then analyzed in three phases. First,
the "best" 15 and 30 items were selected using each item analytic
technique. Under classical item analysis, the best 15 and 30 items
were selected based on the highest biserial correlations. For factor
analysis, the best 15 and 30 items were selected based on the highest
item loadings on the first (unrotated) principal component. The
selections of the best 15 and 30 items using tlie Rasch model were
based upon the mean square fit of the items to the model. These
procedures were used for each group of subjects. Second, a double
cross-validation design was employed to obtain estimates on the item and
test parameters for the best 15 and 30 items. Tlie items selected from
tlic three item analytic teclmiques were scored for different samples
of subjects by randomly reassigning the samples wliich Iiad been used
in the original item analysis.
102
Then, the best 15 and 30 itc.is chosen by each method were submitted
to a common item analytic procedure in order to obtain estimates for
comparing the three item analytic methods. Third, a two-way analysis
of variance and a Tukey HSD post hoc comparison test, when indicated,
were used to test for differences in the properties of items selected
by each item analytic procedure. Also confidence intervals were
calculated to compare the internal consistency estimates to a population
value. In addition, the relative efficiencies of the 30 item tests
developed by each item analytic technique were compared for the sample
of 995 subjects.
The results of the analysis showed that there were no apparent
differences in the types of tests produced by the three methods of item
analysis in terms of the precision of measurement. The three methods were
compared on measures of internal consistency, the standard error of
measurement, mean item difficulty, and mean item discrimination.
Confidence intervals were generated around the observed internal
consistency estimates for the 15 and 30 item tests produced by each
method. The confidence intervals were obtained to compare the observed
internal consistency estimates from each test to the project population
internal consistency for tests of a similar length as suggested in
Feldt (1965) . The projected population value (obtained via Spearman-
Brown Prophecy Formula) represented what the test's internal consistency
would have been for a test created by deleting items at random. Of the
18 confidence intervals calculated at a 95 percent level of confidence,
15 did not contain tlie projected population value. Therefore, it was
concluded that the three item analytic techniques were significantly
different from random item deletion in producing tests with liigher
internal consistenc\' estimates. It was also noted that as the
103
number of examinees decreased so did the internal consistency
estimates.
The standard error of measurement was consistently smaller for
both the 15 and 30 item tests produced by classical test theory when
compared to tlie 15 and 50 item tests based on the Risch model. However,
the differences in the standard errors of measurement did not equal or
exceed 1.00 for any of the methods.
No significant F ratios were observed for either 15 of 30 item
tests on item difficulty. Therefore, each item analytic technique tended
to select items which had similar item difficulties on the average.
For the variable, item discrimination on the 15 item tests, a
significant £ ratio (p^ < .05) was observed for the main effect of the
item analytic technique. Tukey's USD post hoc analysis indicated that
tlie Rasch test tended to contain items with lower biserial correlations,
on the average, tlian tests procedured by factor analysis and classical
item analysis. This finding was probably due to the fact that the range
of the biserial correlations was restricted to .39 - .79 for items
retained in the Rasch analysis. However, when the test length was
increased to 30 items, no significant F_ ratios were observed for the
variable of item discrimination.
In terms of the test efficiency, the results indicated substantive
differences in tlie tests produced by the three methods of item analysis.
The 30 item test for the sample of 995 examinees was used in this com-
parison. It was found that the Rasch developed test was superior to
the two tests based on classical test theory for students of average to
high ability. The tests based on classical test theory, however, were
superior in efficiency to the Rasch test for students of very low
10
#
or very high ability. In light of these findings, it was suggested
that measures of test efficiency ouglit to be incorporated into the
test development procedures, as it provides much more detailed
information on how the test discriminates for various ability groups
than does a single overall estimate of a test's homogeneity.
REFERENCES
Anderson, E. B. Goodness of fit for the Rasch model. Psychometrika1973, 38, 123-140.
Anderson, J., Kearney, G., and Everett, A. An evaluation of Rasch'sstructural model for test items. British Journal of Mathematicaland Statistica l Psychology , 1968, 2j_, 231-238.
Anderson, L. W. A comparison of classical item analytic precedureswith affective data. A paper presented at the annual meeting ofthe American Educational Research Association, San Francisco,April, 1976.
Baker, F. B. Empirical comparison of tlie item parameters based on thelogistic and normal functions. Psychometrika , 1961, 26, 235-246.
Baker, F. B. Advances in item analysis. Review of Educational Research,1977, 47, 151-178.
Baker, F. B. , and Martin, T. Fortap: A Fortran test analysis package .
Occasional Paper No. 10, Office of Research Consultation, Collegeof Education, Michigan State University, 1970.
Bedell, B. J. Determination of the optimum number of items to retainin a test measuring a single ability. Psychometr ika, 1950, 15,419-430.
—
Benson, 1. G. The Florida Twelfth Grade Testing Program: A factoranalytic study of the aptitude and achievement subtests. Un-published Master of arts in education, thesis. University ofFlorida, 1975.
Binet, A. and Simon, T. (Tlie development of intelligence in children)(E. S. Kite, trans.). Baltimore, Md . : Williams ^ Wilklns, 1916.
Birnbaum, A. Some latent trait models and their use in inferring anexaminee's ability. In F. M. Lord and M. R. Novick, Statisticaltheories of mental test scores . Reading, Ma. : Adison-Wesley, 1968.
Bock, R. D. and Wood, R. Test theory in P. H. Mussen and M. R.
Rosenweig (Eds.), Annual review of psychology (Vol. 22). Palo Alto,Ca. : Ainuuil Reviews, Inc., 1971
Brogden, H. E. Variation in test validity with variation in the testdistribution of item difficulty, number of items, and degree of theirintercorrelation. Psychometrika, 1946, 11, 197-214.
105
10()
Burt, C. and John, E. A factoj-ial analysis of the Terinan-Binet tests.
British Journal of tiJticational Psycliology , 1943, 1_2, 156-161.
Carroll, J. B. The nature of the data, or how to choose a correlationcoefficient. Psychometrika , 1961, 26, 347-572.
Cattail, R. B. Personality and motivation , structure and measurement .
New York: World Book, IncTT 1957.
Cattell, R. B. and Tsujioka, A. The importance of factor truenessand validity, versus homogeneity and orthogonality in testscales. Educa t ional and Psychological Measurement , 1964, 24_, 3-30.
Cochran, W. G. and Cox, G. M. Experimental Designs . (2nd ed.) New
York: John Wiley and Sons, Inc., 1957.
Cox, R. C. Item selection techniques and evaluation of instructionalobjectives. Journal of Educational Measurement , 1965, 2_, 181-185.
Cronbacli, L. J. Coefficient alpha and tlie internal structure of tests.Psychometrika , 1951, 16, 297-334.
Davi s , E . Item Analysis Data: Their computation , interpretation anduse in test construction . Cambridge, Ma. : Harvard University, 1946.
Davis, E. Item selection techniques. In E. F. Lindquist (Ed.),
Educational Measurement . Washington, D. C: American Council onEducation, 1951
.
Dinero, T. E. and Haertel, E. A computer simulation investigating theapplicability of the Rasch model with varying item discrimination.A paper presented at the annual meeting of the National Council onMeasurement in Education, San Francisco, April 1976.
Fan, C. T. Item analysis table . Princeton, N. J.: EducationalTesting Service, 1952.
Feldt, E. S. The approximate sampling distribution of Kuder-Richardsonreliability coefficient twenty. Psychometrika , 1965, 30, 357-365.
Flanagan, J. General considerations in selection of test items and a
short method of estimating the product moment coefficient from dataat the tails of the distribution. Journal of Educational Psychology
,
1939, 30, 674-680.~"
Florida Twelfth Grade Testing Program . (Report No. 1-75J . Gainesville,Florida: Office of Instructional Resources, 1975.
Fruchter, B. Introduction to factor analysis. New York: D. VanNostrand Co
., Inc. ,
1954~
Guertin, W. and Bailey J. Introduction of modern factor analysis. AnnArbor, Mich.: Edward Bros., Inc., 1970.
107
Guilford, J. P. Psychometric methods . New York: McGraw-Hill, 1954.
Gulliksen, H. The relation of item difficulty and inter-itemcorrelation to test variance and reliability. Psychometrika
,
1945, 20., 79-91 .
Gulliksen, H. Theory of mental tests . New York: John Wiley and Sons,Inc., 1950.
Hambleton, R. K. and Cook, L. Latent trait models and their use in
analysis of educational test data. Journal of EducationMeasurement , 1977, 1A_, 75-96.
Hambleton, R. K. and Traub , R. E. Information curves and efficiencyof three logistic test models. British Journal of Mathematicaland Statistical Psychology , 1971, 24_, 275-281.
Hambleton, R. K. and Traub, R. E. Analysis of empirical data usingtwo logistic latent trait models. British Journal of Mathematicaland Statistical Psychology, 1973, 26^, 195-211.
Hambleton, R. K., Swaminathan, H., Cook, L., Eignor, D., and Gifford, J,
Developments in latent trait theory: A review of models,
technical issues, and applications. A paper presented at the
annual meeting of the American Educational Research Association,New York, April 1977.
Harman, H. H. Modern factor analysis . (2nd ed.) Chicago: University
of Chicago Press, 1967.
Henrysson, S. The relationship between factor loadings and biserial
correlations in item analysis. Psychometrika , 1962, 27_, 419-424.
Henrysson, S. Gathering, analyzing, and using data on test items.
In E. L. Thorndike (Ed.), Educational Measurement . (2nd ed.)
Washington, D. C: American Council on Education, 1971.
Hotelling, H. Analysis of a complex statistical variables into
principal components. Journal of Educational Psychology , 1933,
24, 417-441, 498-520.
Hoyt, C. Test reliability estimated by analysis of variance.
Psychometrika , 1941, 6_, 153-160.
Kelley, T. L. The selection of upper and lower groups for the
validation of test items. Journal of Educational Psychology , 1939,
30, 17-24.
Kirk, R. Experimental design: Procedures for the behaviora l sciences .
Belmont, Ca.: Wadsworth, 1968.
Lange, A., Lehmann, I., and Meherens, W. Using item analysis to
improve tests. Journal of Educational Measurement , 1967, 4_,
65-68.
108
Lazarsfeld, P. F. The logical and mathematical foundation of latent
structure analysis. In S. A. Stouffer et al.
, Measurement and
prediction. Princeton, N. J.: Princeton University Press, 1950.
Lord, F. M. A theory of test scores. Psychometri c Monographs , 1952,
No. 7 (a).~
Lord, F. M. The relationship of the reliability of multiple-choicetests to the distribtuion of item difficulties. Psychometrika
,
1952, LS, 181-194 (b) .
Lord, F. M. An application of confidence intervals and of maximumlikehood to the estimation of an examinee's ability. Psychometrika
,
1953, 2i, 57-75 (a).
Lord, F. M. The relations of test scores to the trait underlying the
test. Educational and Psychological Measurement , 1953, 13^, 517-548 (b)
Lord, F. M. An analysis of the Verbal Scholastic Aptitude Test usingBirnbaum's three-parameter logistic model. Educational andPsychological Measurement , 1968, 28, 9S9-1020T
Lord, F. M. The relative efficiency of two tests as a function ofability level. Psychometrika , 1974, 39, 351-358 (a).
Lord, F. M. Quick estimates of the relative efficiency of two testsas a function of ability level. Journal of Educational Measurement,1974, IJ^, 247-254 (b) .
Lord, F. M. Practical applications of item characteristic curve theory.Journal of Educational Measurement , 1977, 14_, 117-138.
Lord, F. M. and Novick, M. Statistical theories of mental test scores .
Reading, Ma.: Addison-Wesley , 1968.
Lumsden, J. The construction of unidimensional tests. PsychologicalBulletin , 1961, 58^, 122-151.
McNemar, Q. The revision of the Stanford-Binet scale. Boston:Houghlin-Mlfflin, T9l2".
McNemar, Q. Psychological Statistics . New York: John Wiley and Sons,Inc., 196T;
Magnusson, D. Test theory . Reading, Ma.: Addison-Wesley, 1966.
Mandeville, G. K. and Smarr, A. M. Rasch model analysis of threetypes of cognitive data. A paper presented at the annual meetingof the American Educational Research Association, San Francisco,April 1976.
Marscuilo, L. A. Statistical methods for behavioral science research .
New York: McGraw-Hill, 1971^
109'
Mehrens , W. and Lehmann, I. Measurement and evaluat ion iji educa tionand psychology. New York: Holt, Rinehart, and Winston, Inc., 1973.
Mendenhall, W., Ott, L., and Schaffer, R. E lementary survey sampling .
Belmont, Ca. : Wadsworth Publishing Co., 1971.
Mosier, C. 1. Problems and designs of cross validation. Educationaland Psychological Measurement , 1951, l]_, 5-11.
Myers, C. T. The relationship between item difficulty and testvalidity and reliability. Educational and Psychological Measurement
,
1962, 22, 565-57.
Pearson, K. On the correlation of characters not quantativelymeasurable. Royal Society Philosophical Transactions , Series A,
1900, 295, 1-47.
Pearson, K. On a new method of determining a correlation between a
measured character of A, and a character of B, of wliich only the
percentage of cases wherein B exceeds (or fall short of) intensityis recorded for each grade of A. Biometrika , 1909, 7_, 96-105.
Rasch, G. Probablistic models for some intelligence and attainment
tests . Copenhagen, Denmark: Danmarks Paedagogiske Institute, 1960.
Rasch, G. An item analysis which takes individual differences into
account. British Journal of Mathematical and Statistical Psychology ,
1966, 19, 49-57.
Rentz, C. Rasch model invariance as a function of the shape of the
sample distribution and degree of model -data fit. A paperpresented at the annual meeting of the Florida Educational Research
Association, Januaiy 1976.
Rentz, R. R. and Bashaw, W. L. The national reference scale for reading;
An application of the Rasch model. Journal of Educational Measure -
ment , 1977, 2i, 161-179.
Richardson, M. IV. Notes on the rational of item analysis.
Psychometrika , 1936, J^,69-76.
Ryan, J. P. The rationale for the Rasch model. A paper presented at
the pre-convention training session for the Rasch model, the annual
meeting of the American Educational Research Association, New York,
April 1977.
Spearman, C. General intelligence objectively determined and measured.
American Journal of Psychology , 1904, 15^, 201-293.
Swineford, F. Biserial r versus Pearson r as measures of test-item
validity. Journal of Educational Psychology , 1936, 27^, 471-472.
lUi
Tinsley, II. and Dawis, R. Aii investigation of the Rasch simjile
logistic model: Sam])le- free item and test calibration. Educational
and Psychological Measurement , 1975, 3^, 325-339.
Thurstone, L. L. Multiple factor analysis . Chicago: University of
Chicago Press, 1947.
Ware, W. B. and Benson, J. Appropriate statistics and measurement
scales. Science Education , 1975, 5^, 575-582.
Webster, H. Maxiniimizing test validity by item selection. Psychometrika ,
1956, 2j_, 153-lb4.
Wherry, R. J. and Winer, B. J. A metliod for factoring large numbers ofitems. I'sychometrika , 1955, J^, 161-179.
Whitely, S. and Dawis, R. The nature of objectivity with the Rasch
model. Journal of Educational Measurement , 1974, J_l, 163-178.
Whitely, S. and Dawis, R. The influence of the context on item
difficulty. Educational and Psychological Measurement , 1976, 36 ,
329-337.
Woodcock, R. W. Woodcock Reading Mastery Test . Circle Pines, Minn.:American Guidance Service, 1974.
Wright, B. Sample- free test calibration and person measurement.Proceedings of the 1967 Invitational Conference on TestingProblems . Princeton, N. J. : Educational Testing Service, 1968.
Wright, B. Solving measurement problems with the Rasch model. Journalof Educational Measurement , 1977, H_, 97-116.
Wright, B. D. and Mead, R. J. CALFIT: Sample- free item calibrationwith a Rasch measurement model. Research Memorandum No. 18
.
Chicago: Statistical Laboratory, Department of Education, Universityof Chicago, 1975.
Wright, B. D. and Mead R. J. BICAL: Calibrating rating scales withthe Rasch model. Research Memorandum No. 23. Chicago: StatisticalLaboratory, Department of Education, University of Chicago, 1976.
Wright, B. D. and Panchapakesan , N. A procedure for sample-free itemanalysis. Educational and Psychological Measurement , 1969, 29
,
23-48.
APPl-NDIX A
MATHEMATICAL DERIVATION OF mE RASCH MODEL^
Summarized from Ryan (1977)
.NLVniEM-VriCAL DERIVATION OF lllE lUSCll MODEL
If ability is equal to and item difficulty is equal to i; then
the odds of correctly solving an item is given by
,^ e , (1)odas =:
^ '
s
Ulien 6>£, the person will get the item right, when 9<c the person will
get the item wrong, and when 9 = ^ the person will have a 50-50
chance for success on the item.
Tlie probability of a correct response can readily be derived from
the statement of the odds. In general,
oddsProbability = ' '''''
1 + odds
substituting (1) into (2)
P -•
C5)
1 + e/c
which is equivalent to
p . ^/^ = ^1^-.
^^ (4)
c/c + e/c c + Q/c c + e
Formula 4 is the prLbabiHty of a correct response.
Tlie formula of an incorrect response is represented by Q, which is
1-P. Thus,
Q = (1-P) = 1 -^
, (5)
; + 6
which is equivalent to
Q = c + 9 - 6 = c . (6)
c. + Q Q + i> C + 9
112
115
Formula 6 is the probability of an incorrect response.
Tlie relationship of these probability statements to the statement
of the odds is given by
Q ^/c + e c
The right side of equation 7 is simply the odds as defined in equation 1.
The left side of the equation is the probability of a correct response
divided by the probability of an incorrect response. The probability of
a correct response, P, is estimated by the proportion of examinees in
a sample who correctly answer an item. On a test of k items, the pro-
bability of a person with score of j (where 1 ^ j ^ 1<) , correctly
answering a particular item is simply the proportion of people with
the raw score j who correctly answered the item. This is nothing more
than the item difficulty for all the people who have a raw score of j.
The probability can be calculated in terms of the item difficulty for
all raw score groups from 1 to (k-1) and across all items.
The probability of an incorrect response, Q, is the proportion of
people answering an item incorrectly. Tliis is simply one minus the
proportion answering it correctly (1-P) . Both P and Q are easily
calculated from a set of data hence the value of P/Q in formula 7 is
an easily derived statistic which estimates the odds.
Separating the Parameters
Consider again equation 7 and take the natural log (In) of both
sides of the equation. This gives
In (^) = In (|) . (8)
\u
Since the In of a rtitio is the same as the difference of the In.
8 becomes
In (|) = In 6 - In C • (9)
Person Free Item Difficulties
Next consider a score group with ability 6 and two test items
onwith difficulties c and c, respectively. The probability of a pers
with ability 6 correctly answering an item of difficulty r is P.,.
The probability of an incorrect response is Q, , . Tlie probability of
a person with ability 6 correctly answering an item of difficulty ^
is P.-, and tlie probability of an incorrect response is Q -,. By
equation 9,
PllIn (^) = In 6 - In r and (10)
^11
In (^) = In 9 - In c.. (11)^12
If equation 11 is subtracted from equation 10, term by term, the result
is
P P
In (A - In {—) = (In 9 - In Cj - (In 6 - In ^) . (12)^11 ^12
Tills is the same as
P P
In (^) - In (^) = In 9^ - In c^ - In 9^ + In r,^, (13)
P P
or. In (^) - m (^) - In c^ " In C^- (14)
Ec[uation 14 should be examined very carefully. On the left side of
the equation is an easily calculated statistic: The difference
between the In odds of a correct response on item 1 compared to item 2.
115
The right side is significant for what it does not contain. There is
no parameter for tiie person's ability on the right side of tl\e equation.
Tliis same result occurs regardless of the ability of the person or
group examined. The difference between the difficulty of item 1 and
the difficulty of item 2 can be calculated independently of the
subject or group of subjects involved. In general, for any two items
of difficulty Ci and ; , the difference between the difficulty of the1 m _^. „ „ . „ f_^
two items is given by
in C„^ - in C^ ^ in iA - In (^) . (15)
1 1 i m
Item Free Person Abilities
The discussion of person ability estimates is an exact parallel
to the discussion of item difficulty estimates. Instead of comparing
two items across any group of examinees the discussion of ability
proceeds by comparing any two groups on any test item. Consider score
group with ability 6 score group ^ with ability Q^, and item with
difficulty Q . From equation 9,
^11In (^^—) = In 9 - In c,, and (16)
^11
^1In (—) = In e^ - In ^^ . (17)
Subtracting equation 17 from equation 16 will yield
P P
In i-A - In i-^) - In 8^ - In 6 (18)^11 ^21
The difference between the abilities of the examinees in score group ,
and score group ^ (the right side of equation 18) is described without
reference to the item involved. In general, for any two groups with
abilities 9 . and 9 .
,
1 J
lib
In 6
P., P..
In a = m (^) - In (^) (19)
for any item with difficulty c In this case the abilities are being
compared independently of the difficulty of the item used to compare
them. Tliis is often referred to as item free person ability estimation.
Formalizing the Model
To describe the Rasch model let ln6 and In C be re-defined.
Specifically, let,
e = In 6, and (20)
6 = In c. ^—.,._^^ (21)
Equations 20 and 21 simply define the In ability as B and the In
difficulty as 6.—
If both sides of equation 20 and 21 are raised to the base of the
natural log system, e, we get
3 In , ande = e
6 In c.e = e
Recall equation 5
P = 6/C
(22)
(23)
(24)1 + e/c
and substitute the equivalent terms for 9 and C as defined in equations
22 and 25. This gives
6
P =
P =
e / e or
1 + e / e
( 3-6 )
1 + e( 3-6
)
(25)
(26)
117
More formally this is
P (X^. = 1| 3,, 6.)ef \ - '0
l.e^h- ^^ (27)
Equation 27 is the Rasch model,
APPENDIX B
RELATIVP. EFFICIENCY VALUES USED IN FIGURE 2 FOR
THE COMPARISONS AMONG ITEM ANALYTIC METHODS
TABLE B.l
RELATIVE EFFICIENCY VALUES USED IN FIGURE 2 FOR
THE COMPARISONS AMONG ITEM ANALYTIC METHODS
BIOGRAPHICAL SKETCH
Iris Benson was born January 19, 1946, in Charleston, South
Carolina. She graduated from Hialeah High Scliool, Hialeah, Florida,
in June 1964.
She began attending college part-time in 1968, and graduated in
August 1971 from Santa Fc Junior College in Gainesville, Florida.- Iris
then attended the University of Florida, and received a Bachelor of
Arts degree with honors in March 1973 with a major in Psychology.
In the fall of 1973, she began her graduate studies in the College
of Education at the University of Florida. Upon completing her master's
thesis entitled. The Florida Twelfth Grade Testing Program: A factor
analysis of tl^e aptitude and achievement subtests , she received the
degree Master of Arts in Education in June 1975.
Iris began her doctoral program at the University of Florida in the
fall of 1975. Wliile working on \\ev graduate degrees slie has held, at
various times, a graduate assistantship in tlie area of evaluation and
test develo]Miient , and teaching assistantships in graduate level courses
in educational measurement and statistics. She was also an evaluation
intern at the Northwest Regional Educational Laboratory in Portland,
Oregon from September 1976 to March 1977.
Iris is a member of the American Educational Research Association,
and the National Council on Measurement in Education. She has co-authored
several articles and papers on the topics of examinee test-taking behavior,
appropriate statistics for different measurement scales, and the adversary
evaluation model.
<
120
I certify that I have read this study and that in my opinion itconforms to acceptable standards of scholarly presentation and is fullyadequate, in scope and quality, as a dissertation for the degree ofDoctor of Philosophy.
(•....,• : ,(riA
William B. Ware, ChairmanProfessor of Foundations of
Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fullyadequate, in scope and quality, as a dissertation for the degree ofDoctor of Plulosophy.
Linda M. Crocker -
Associate Professor of Foundationsof Education
I certify that I have read this study and that in my opinion it
conforms to acceptable standards of scholarly presentation and is fullyadequate, in scope and quality, as a dissertation for the degree ofDoctor of Philosophy.
L•v^^^^ \\l.u-t^Ahn M. Newell
P]3ofessor of Foundations ofEducation
I certify that I have read this study and that in my opinion itconforms to acceptable standards of scholarly presentation and is fullyadequate, in scope and quality, as a dissertation for the degree ofDoctor of Philosophy.
William R. PowellProfessor of Instructional
Leadership and Support
This dissertation was submitted to the Graduate Faculty of theDepartment of Foundations of Education in the College of Education andto the Graduate Council, and was accepted as partial fulfillment of therequirements for the degree of Doctor of Philosophy.
December 1977
I01^'C tfKc<.-Chairman, Foundations ofEducation
Dean, Graduate School
^7
4^F^13 7 8. 03.8.5.