validity of multiple- reading comprehension

12
1 On the Construct Validity of Multiple- Choice Items for Reading Comprehension Huub van den Bergh University of Utrecht In this study 590 third-grade students took one of four reading comprehension tests with either multiple- choice items or open-ended items. Each also took 32 tests indicating 16 semantic Structure-of-Intellect (si) abilities. Four conditions or groups were distinguished on the basis of the reading comprehension tests. The four 33 x 33 correlation matrices were analyzed si- multaneously with a four-group LISREL model. The 16 intellectual abilities explained approximately 62% of the variance in true reading comprehension scores. None of the SI abilities proved to be differentially re- lated to item type. Therefore, it was concluded that item type for reading comprehension is congeneric with respect to the SI abilities measured. Index terms: construct validity, item format, free response, reading comprehension, Structure-of-Intellect model. Since the early 1920s the question of whether items in different formats tap the same mental func- tions has been an active area of research. Unfor- tunately, the results of different studies are equiv- ocal. Some authors have claimed that there are no differences between tests with different item for- mats (Bracht & Hopkins, 1968; Carter & Crone, 1940; Cook, 1955; Weiss & Jackson, 1983), whereas others reached the opposite conclusion (Birenbaum & Tatsuoka, 1987; Coombs, Milholland, & Womer, 1956; Heim & Watts, 1967; Traub & Fisher, 1977; Vernon, 1962; Ward, Frederiksen, & Carlson, 1980). A theoretical basis has been lacking in most stud- ies. In most cases differences in observed scores due to item format are interpreted and explained in relation to the differences between recall and recognition. In this case the essence of recall is the generation or construction of an answer to an item. The respondent must formulate the answer in oral or written form, or must describe an idea if that is the answer desired. The essence of recognition, on the other hand, is that one or more alternatives are presented to the respondent; there is no requirement for overt generation of an answer. The respondent may select one of two strategies: a strategy in which an answer is generated or constructed, and a com- pare-and-delete strategy on the basis of the cues provided in the alternatives. Hence recognition is sometimes mediated by process characteristics of recall, that is, processes that do not depend on the presence of alternatives (see Brown, 1976; Gillund & Shiffrin, 1984). The recall/recognition paradigm encompasses differences in verbal abilities tested by means of open-ended and multiple-choice items (Doolittle & Cleary, 1987; Murphy, 1982; Ward, 1982; Ward et al., 1980), as well as the assessment of partial knowledge measured by means of mul- tiple-choice items (Coombs et al., 1956). The lack of a theoretical framework might be one of the reasons for the different research con- clusions. For instance, some researchers interpret differences in means between two tests, or differ- ences in observed variances, as indications for a difference in the intellectual abilities tested (see Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227 . May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Upload: others

Post on 12-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Validity of Multiple- Reading Comprehension

1

On the Construct Validity of Multiple-Choice Items for Reading ComprehensionHuub van den Bergh

University of Utrecht

In this study 590 third-grade students took one offour reading comprehension tests with either multiple-choice items or open-ended items. Each also took 32tests indicating 16 semantic Structure-of-Intellect (si)abilities. Four conditions or groups were distinguishedon the basis of the reading comprehension tests. Thefour 33 x 33 correlation matrices were analyzed si-multaneously with a four-group LISREL model. The 16intellectual abilities explained approximately 62% ofthe variance in true reading comprehension scores.None of the SI abilities proved to be differentially re-lated to item type. Therefore, it was concluded thatitem type for reading comprehension is congenericwith respect to the SI abilities measured. Indexterms: construct validity, item format, free response,reading comprehension, Structure-of-Intellect model.

Since the early 1920s the question of whetheritems in different formats tap the same mental func-tions has been an active area of research. Unfor-

tunately, the results of different studies are equiv-ocal. Some authors have claimed that there are nodifferences between tests with different item for-mats (Bracht & Hopkins, 1968; Carter & Crone,1940; Cook, 1955; Weiss & Jackson, 1983), whereasothers reached the opposite conclusion (Birenbaum& Tatsuoka, 1987; Coombs, Milholland, & Womer,1956; Heim & Watts, 1967; Traub & Fisher, 1977;Vernon, 1962; Ward, Frederiksen, & Carlson,1980).

A theoretical basis has been lacking in most stud-ies. In most cases differences in observed scoresdue to item format are interpreted and explainedin relation to the differences between recall and

recognition. In this case the essence of recall is thegeneration or construction of an answer to an item.The respondent must formulate the answer in oralor written form, or must describe an idea if that isthe answer desired. The essence of recognition, onthe other hand, is that one or more alternatives are

presented to the respondent; there is no requirementfor overt generation of an answer. The respondentmay select one of two strategies: a strategy in whichan answer is generated or constructed, and a com-pare-and-delete strategy on the basis of the cuesprovided in the alternatives. Hence recognition issometimes mediated by process characteristics ofrecall, that is, processes that do not depend on thepresence of alternatives (see Brown, 1976; Gillund& Shiffrin, 1984). The recall/recognition paradigmencompasses differences in verbal abilities tested

by means of open-ended and multiple-choice items(Doolittle & Cleary, 1987; Murphy, 1982; Ward,1982; Ward et al., 1980), as well as the assessmentof partial knowledge measured by means of mul-tiple-choice items (Coombs et al., 1956).The lack of a theoretical framework might be

one of the reasons for the different research con-clusions. For instance, some researchers interpretdifferences in means between two tests, or differ-ences in observed variances, as indications for adifference in the intellectual abilities tested (see

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 2: Validity of Multiple- Reading Comprehension

2

Shohamy, 1984; van den Bergh, 1987). This seemsa rather hasty conclusion, because the same processmight be involved both in recognition and in recall.

There are other reasons for the differences inresearch results that may be of even greater im-

portance. In none of the studies is an effect size

specified; therefore the null hypothesis can alwaysbe rejected (see Cronbach, 1966). Neverthelessseveral conceptions are implicit with regard to thesize of the effects. For instance, some authors testwhether the correlation coefficients (corrected fortest unreliability) between scores on tests differingin item format differ from unity (Hogan, 1981;Mellenbergh, 1971). Others interpret correlationcoefficients of at least .80 as proof of the nullhypothesis, which is that there are no differencesin intellectual skills measured on tests with open-ended and multiple-choice items (Hurd, 1930; Pa-terson, 1926).A second inadequacy concerns the research de-

sign. Most research is carried out according to arepeated-measurement design. Almost invariably,the test with open-ended items is administered first,followed by one or more tests in alternative for-mats. Then the product-moment correlation be-tween the scores on the different tests is calculated.If a respondent must answer the same question,although in a different format, up to six times ina few weeks (Traub & Fisher, 1977), it is ques-tionable whether the same traits are measured onthe first and on the last administration.The administration procedures can be improved

by randomizing the tests administered on each oc-casion (Hogan, 1981). In this type of study themain interest lies in the shared variance of the scoreson tests differing in item format. Therefore, a re-peated-measurement design in which all tests areadministered at all occasions seems the most ap-propriate solution (van den Bergh, Eiting, & Otter,1988). In that case the amount of variance sharedby the scores on two tests different in item formatcan be compared to the amount of variance sharedby the scores on tests with the same format.A third problem seems to be the questionable

use of psychometrics. Correlation coefficients cor-rected for attenuation on the basis of the often-usedcoefficient out may be an overestimate of the true

correlation between two variates (Lord & Novick,1968, p. 138). Also, in many studies the criterionvalidity of tests differing in item format is inves-tigated by testing differences in the correlationcoefficients of scores on both types of test with acriterion (Culpepper & Ran~sdale, 1983; Hopkins,George, & Williams, 1985; Hopkins & Stanley,1981; Weiss & Jackson, 1983). Hence it is assumedthat both test scores have an equal regression onthe true score; it is assumed that both types of testare at least tau-equivalent.

Strictly speaking, this assumption of tau-equiv-alence is not necessary; the research question trans-lated into a testable form only concerns the con-genericity (Joreskog, 1971) of different test types.

That is, the tests measure the same trait(s) exceptfor errors in measurement. The scores on tests with

multiple-choice items have a higher mean (Benson,1981; Benson & Crocker, 1979; Kinney & Eurich,1938; Samson, 1983; Shohamy, 1984) and lowerobserved and true-score variances than scores ontests with open-ended items (Carter & Crone, 1940;Croft, 1982; Frary, 1985; van den Bergh, 1987).Therefore, tests differing in format cannot be con-sidered tau-equivalent a priori; in all cases a special

. test is needed to see whether this assumption ismet.

This study was not only concerned with whetherthere were differences in intellectual abilities testeddue to item format, but was also concerned withthe nature of any differences observed. Hence it isinsufficient to take into account only theoreticalnotions of differences in item format. Intellectualabilities that are differentially involved in answer-ing open-ended and multiple-choice items must bedefined and taken into account as well.

One of the most concise models for intellectualabilities is the Structure-of-Intellect (si) model(Guilford, 1971, 1979; Guilford & Hoepfner, 1971).In the si model, abilities are defined on three di-mensions. The operations dimension concerns theintellectual activities to be performed. Five oper-ation types are distinguished: cognition (c), diver-gent production (D), evaluation (E), memory (M),and convergent production (N). The contents di-mension characterizes basic kinds or areas of in-formation : behavioral (B), figural (F), semantic (M),

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 3: Validity of Multiple- Reading Comprehension

3

and symbolic (s) information. The products di-mension concerns the organization of the infor-mation to be processed. A distinction is made be-tween classes (c), implications (i), relations (R),systems (s), transformations (T), and units (u). Theintersection of the three dimensions is indicated bya trigram; for example, CMT denotes cognition ofsemantic transformations, or the ability to see

changes in meaning or interpretation.The si abilities involved in carrying out an as-

signment are, of course, dependent on the subjectmatter. In this study, where the subject area wasreading comprehension, the primary interest wasin abilities having to do with meaning or under-standing. Therefore this study was restricted to thesemantic part of the si model (see Meuffels, 1982;Spache, 1963).The difference between recognition and recall

can be translated into semantic si abilities. Recall,which refers to the retrieval of items from memorystorage in order to meet certain objectives, can beassigned to the si abilities of divergent and con-vergent production (Guilford, 1971). Divergentproduction refers to an intellectual ability that per-tains primarily to information retrieval. Convergentproduction is the prevailing function when inputinformation is sufficient to determine a unique an-swer, which also must be retrieved from memory.

In recognition two strategies may be involved.Convergent and divergent abilities may be essentialif a respondent answers multiple-choice questionsin the same way as open-ended questions. On theother hand, the presentation of alternatives maygive rise to differences in cognition and evaluationabilities. Cognition abilities refer to immediate

awareness, immediate discovery, or rediscovery ofinformation, whereas evaluation abilities concern

comparison according to a set criterion and makinga decision about criterion satisfaction. Both seem

important: The respondent may &dquo;know&dquo; the an-swer to a multiple-choice item as soon as the al-ternatives are seen, or may actually compare thedifferent alternatives presented. Besides, the cog-nition abilities (e.g., cms, CMT, and CMU) are cru-cial for reading comprehension (Guilford, 1979;Hoeks, 1985; Meuffels, 1982); for instance wordknowledge, or cmu, is one of the best (single)

predictors of reading comprehension (C~uilford, 1979;Hoeks, 1985; Mezynsky, 1983; Stemberg & Pow-

ell, 1983).This study explored the question of whether items

for reading comprehension are congeneric irre-

spective of the format of the items, that is, whetherthey are congeneric with respect to the abilitiestested. This general question can be formulatedmore precisely: It was hypothesized that open-endedquestions for reading comprehension appeal to di-vergent- and convergent-production abilities,whereas cognition and evaluation abilities are prev-alent in multiple-choice items. These differenceswill be observed in differences in the regressionweights of reading comprehension scores on the siability scores.

Another question investigated was: What si abil-ities are involved in answering items in traditionalreading comprehension tests? It might be arguedthat apart from the already mentioned si abilities(cognition, evaluation, convergent-production, anddivergent-production), si memory abilities mightalso be of importance for the answering of readingcomprehension items.The role of memory abilities is emphasized in

current theories of text comprehension. In thesemodels it is asserted that (experienced) readersidentify and relate text topics as they read (Just &Carpenter, 1980; Kintsch & van Dijk, 1978; Lorch,Lorch, & Matthews, 1985). Hence the answeringof reading comprehension items might require theactivation and revision of the text representation(Lorch, Lorch, & Mogan, 1987). Therefore, anefficient method of encoding information seemscrucial. The import of topic structure might thenbe reflected in the memory for semantic systems(MMS) and the evaluation of semantic systems (EMS);see Ackerman (1984a, 1984b). In short, it was

hypothesized that memory abilities (especially mms)are crucial for the answering of reading compre-hension items, but are not differentially involvedin answering open-ended and multiple-choice items.

Method

Examinees

The examinees were 590 third-graders from 12

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 4: Validity of Multiple- Reading Comprehension

4

different Dutch high schools for lower vocationaleducation or lower secondary education. Each stu-dent was randomly assigned to one of the fourconditions. Conditions were defined on the basisof the reading comprehension test administered (seeTable 1). Administration time for all reading com-prehension tests was 100 minutes.

Dependant Variables

Two reading comprehension tests were con-

structed on the basis of two central Dutch languageexams for lower vocational education. All itemswere open-ended; some were essay questions andothers were short-answer questions. The tests wereconstructed so that the distribution of items, ac-cording to the categories of a classification schemefor reading comprehension items (Davis, 1968),was equal.The two tests were pretested in another group

of 438 third-graders from both types of secondaryeducation. The alternatives of the four-choice itemsconstructed were based on incorrect answers pro-vided by the students. This resulted in four testswith 25 items each: two tests with the original open-ended items and two tests with the same items ina multiple-choice format.

In the final study each student was randomlyassigned to one of the four reading comprehensiontests. The answers of the students to the open-endeditems were all rated once by an experienced rater,whose stability proved to be high as indicated bythe correlation between total reading comprehen-

sion scores on the first rating and on a second ratingthree weeks later ( r = .91, N = 50). The answersof 100 students were also rated by a second rater.The agreement between both raters was satisfactory(r = .86, N = 100).

Because this study used a randomized group de-sign, every respondent took only one reading com-prehension test. Therefore, every respondent an-swered the reading comprehension items only oncein either a multiple-choice or an open-ended for-mat. Hence any dependency of the answers due tomemory effects from an earlier to a later admin-istration was impossible. The si ability tests wereadministered to all respondents.

Table 1 presents summary data for the four read-ing comprehension tests. In addition to the relia-bility estimate « of a reading comprehension test,the split-half reliability estimate is reported. Thesplit-half estimate was obtained by ordering theitems on the basis of both their difficulty (p) valuesand item-test correlations. The two items neareston both dimensions were assigned randomly to oneof the two subtests (Gulliksen, 1950). Then thereliability estimate was obtained by application ofthe Spearman-Brown formula for parallel tests tothe correlation between the subtests.

As could be expected, the means of both mul-tiple-choice tests were higher than the means ofthe same tests in open-ended item format. None ofthe tests was very easy, nor do they appear to havebeen extremely demanding. The standard devia-tions and the homogeneity of both multiple-choicetests are lower than the respective figures for the

Table 1Number of Examinees (N), Mean,

Standard Deviation (SD), Coefficient a,and Split-Half Reliability Estimates (run)

for Two Reading Comprehension Tests &dquo;

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 5: Validity of Multiple- Reading Comprehension

5

open-ended tests. The small differences betweenthe two reliability estimates are also worth noting;coefficient cx is only a fraction lower than the the-oretically more sound split-half reliability.

Sl Ability Tests

In this study 16 of the 30 semantic abilities weremeasured by means of two tests each (see Table2). For every ability a choice was made of two ofthe three tests designed to measure a specific siability (van den Bergh, 1989). The main criterionwas the proportion of shared variance; the two testswith the highest communality on the si ability (froma previous analysis; see van den Bergh, 1989) wereselected (see Table 2).

Almost every student took all 32 tests; only about10 percent of the students had missing scores onone or two tests. The correlations between the

achievement scores on the si tests supported the simodel with oblique ability factors; other modelsbased on theories of Cattell (1963), Spearman (1927),Guttman (1965), and Thurstone (Thurstone &Thurstone, 1941) did not fit the data as well. Theadjusted goodness-of-fit index and the root-mean-square residual for this model were estimated at.94 and .07, respectively (van den Bergh, 1989).As can be seen in Table 2, the majority of the

tests shared most of their variance with the othertests for the same ability. There are some excep-tions, however; the communality of the test &dquo;Timeorder&dquo; is low, for instance. The estimated corre-lations between factors were low, ranging from.121 (between EMU and MMU) to .950 (betweenNMS and NMT), with a mean of .224.

AnalysesFour correlation matrices were calculated. For

each group in which the respondents took the samereading comprehension test, the correlations amongthe achievement scores on the si tests and the cor-relation with the reading comprehension score werecalculated.

The four correlation matrices were analyzed si-multaneously with i.ISR~L (J6reskog & Sbrbom,1986). Because of the criticism on model fitting,

which in essence comes down to chance capitali-zation, a cross-validation was performed on thevariables and the models to be fitted were specifiedin advance.

Because every si ability was measured with twotests, the variable set could be split into two parts.In each variable set each si ability was measuredby one test. If the results converged, there wasadditional evidence of the significance of the modelaccepted. Only in the final analysis could all var-iables be taken into account, because in that casethe si factors were better defined, resulting in smallerstandard errors of the regression weights. On theother hand, specification of the models to be fittedminimized chance capitalization, because none of

Table 2Estimated Communalities for the SI Tests

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 6: Validity of Multiple- Reading Comprehension

6

the model specifications was altered in order toobtain a better fit.

Models

Two measurement models and one structural

model were specified for the analyses. Both mea-surement models define the relation between theobserved and the latent scores, whereas the struc-

tural model specifies the relation between the latentvariables. The two measurement models are:

,., A ,-, ,., . lroB ... - -.., fA);’&dquo;

In Equation 1 y~g> refers to the observed readingcomprehension score in the gth subpopulation;A§g> refers to the regression of the observed readingcomprehension score on the latent score q(g), withunity as variance because correlational analyses areperformed. The last term in Equation 1, E(g), is theresidual of the reading comprehension score, withvariance O£g>. Because in every subpopulation onlyone reading comprehension test is administered,y~g~ refers to one dependent variable (which is there-fore restricted; see below).

In Equation 2, x~g~ denotes a column vector of16 independent variables (in the final analysis, 32variables) measured in the gth subpopulation.Agg> refers to a matrix of regression weights of 16variables on 16 factors (in the final analysis, 32independent variables on 16 factors) on the latentvector ~~g~. Finally, 6<~> refers to the residuals ofthe si tests, with variance 8àg).

It is assumed that there is neither covariancebetween ’I)(g) and Eig> nor between ~~g> and 6<g>.Furthermore it is assumed that there is no co-variance among or between the residuals E(g) and6<~>. Therefore the matrix 0(g) is diagonal with theresidual variances of the si ability scores, and

f~Eg> is a matrix with only one element. The cor-relation between the si factors can be estimated inthe matrix ~~g>, which is a symmetric matrix withunities on the diagonal.

In the structural model the regression of the la-tent reading comprehension ability score on thelatent si abilities is defined as

In Equation 3 [’(g) is a row vector with regressionweights of the latent reading comprehension score(11(g» on the si ability factors (g(g)) in the gth sub-population, and ~(9) refers to the unexplained var-iance of the latent reading comprehension score inthe gth subpopulation.

Equation 3 permits simultaneous analysis of therelationships between the latent variables in thefour subsamples. Hypotheses can be tested by vary-ing the restrictions over groups on the gamma pa-rameters (~r~g~), while holding other restrictions overgroups and analyses invariant.

The invariant restrictions over groups and anal-

yses concern the second measurement model (seeEquation 2). Hence restrictions are placed onA(g) and 0(8) and on the correlation between the siability factors specified in the matrix ~g~. Theserestrictions are

With respect to the first measurement model (Equa-tion 1), a set of invariant restrictions over analysesis placed on both the A( 9) and 0(g) parameters. Inall analyses the A( 9) parameters are fixed at thesquare root of the split-half reliabilities for the fourreading comprehension tests (see Table 1), whereas0(,,g) is fixed at 1 minus these reliability estimates.For instance, w’> and e~1) were fixed at .916

[(.84)112] and .16, respectively.In the first model to be tested, it was assumed

that there are no differences in regression of thereading comprehension factor on the si factors dueto the reading comprehension test. It was also as-sumed that there are no differences between thetests due to the format of the items. Hence the

following restriction was imposed:,., ,., 1&dquo;&dquo;1’ ,,, ~-&dquo;-

where the subscript j refers to the si vector in-volved. This model is called the no-difference model.

In the second model, the format model, the re-strictions are loosened a bit; differences in regres-sion weights due to the reading comprehension test

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 7: Validity of Multiple- Reading Comprehension

7

are allowed, but again no differences between for-mats are permitted. This can be written as

The difference in fit between the first and secondmodels tested is in fact the effect size referred toabove.

In the third model to be tested, no restrictionsare placed on the regression of the reading com-prehension factor on the cognition, evaluation,convergent-production, and divergent-productionability factors. All of these regression weights, yparameters, and (variances of the) disturbance terms(t) must be estimated. Hence differences in regres-sion weights as well as differences in residual var-iances are allowed. Note that in this model the

regressions of the memory abilities are constrainedover item format. This model is called the differ-ence model.

Model Fit

Several procedures may be used to assess the fitof a model and to obtain parameter estimates. Themaximum likelihood procedure is usually a veryefficient method. If the scores have a multivariatenormal distribution, the parameter estimates areprecise (in large samples) and the fit of a modelcan be assessed using a chi-square-distributed sta-tistic. However, if the test scores have no multi-variate normal distribution, an estimation proce-dure without assumptions on the distribution of thetest scores may be preferable (Jsireskog & S&reg;rbom,1986, p. 28). Because the si ability test scores haveno multivariate normal distribution (van den Bergh,1989), the unweighted least-squares estimationmethod is preferred. In that case the fit of the modelsmust be assessed using a goodness-of-fit index,indicating the total amount of variance and co-variance accounted for by the model and the root-mean-square residual (J6reskog & S6rbom, 1986,p. 40). Note, however, that both statistics have anunknown sampling distribution (J6reskog & Sbr-

bom, 1986).With respect to the question of when to reject

the null hypothesis, substantial differences in abil-ities measured due to item format must appear acrosstests. The abilities measured by two tests with thesame item format can then be divided into twokinds: abilities related to the latent trait measured

(reading ability) and abilities related to the itemformat. Of course, two tests with the same formatfor the same latent trait will not correlate perfectlybecause of differences in texts (Biemiller, 1977;Klein-Braley, 1983; Neville & Pugh, 1977) or dif-ferences in items. Therefore, effect size was relatedto the differences in abilities between two readingcomprehension tests. That is, the results were in-terpreted as an effect due to item format only ifthe differences in regression weights of the readingcomprehension test scores with a different formaton si ability test scores were larger than the dif-ferences in regression weights between two dif-ferent reading comprehension tests with the sameformat.The difference in fit between the format model

and difference model is the test of the first hy-pothesis that there are no differences due to theformat of the items. The null hypothesis is rejectedonly if the difference in fit between the format

model and the difference model is larger than thedifference in fit between the no-difference modeland the format model.

Result

Table 3 shows the fit of all three models for bothvariable sets. It can be seen in Table 3 that the fitof the no-difference model for both variable setsis clearly worse than that of the format model.Hence it must be concluded that there are differ-ences in regression weights of the reading com-prehension factor on the si factors due to the read-ing comprehension test involved. The differencein fit between the format model and the difference

model, based on either the goodness-of-fit indexor the root-mean-square residual, is less than thedifference in fit between the no-difference modeland the format model. This holds again for bothvariable sets. According to the criterion specifiedabove, it must be concluded that there is no sub-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 8: Validity of Multiple- Reading Comprehension

8

Table 3Goodness-of-Fit Index (GFI) and

Root-Mean-Square Residuals (RMR) of Three Modelsfor Two Variable Sets for the Relation Between

16 SI Abilities and Reading Comprehension

stantial difference in regression weights of 1)(g) on~(g> due to the format of the items. Hence the an-

swering of open-ended and multiple-choice itemsis congeneric with respect to the si abilities mea-sured.

In Table 4 the parameter estimates are presentedas estimated under the restrictions of the formatmodel (see Equations 9 and 10). The goodness offit of this model, with all 32 si variables included,did not differ very much from the goodness-of-fitestimates of the format model presented in Table3 (goodness-of-fit index = .938; root-mean-squareresidual = .071).As can be seen in Table 4, a relatively large part

of the variance in the true reading comprehensionscores is explained by the 16 si abilities. The pointestimates (1 - ~) varied from 62% for Test A to66% for Test B.

The parameter estimates in Table 4 imply thatreading comprehension has a moderate regressionon most si abilities. The high regression weightsof the memory factors (MMI, MMS, MMT, and MMU)and EMS coincide with current theories of readingcomprehension. The low regression weights ofreading ability on CMU indicate that variance inword knowledge did not play an important role inanswering the items of both tests.The negative weights of the two divergent-pro-

duction abilities (DMR and DMU) indicate that stu-dents who had a high score on these tests had alow achievement score on the reading comprehen-sion tests. These negative weights also can be in-terpreted as suggesting an interaction effect be-tween divergent-production/evaluation abilities and

reading comprehension. In order to inspect the data

with respect to this hypothesis, the scores for DMRand DMU, and the sum of the scores for the eval-uation ability tests (EMI, EMS, and EMU), were di-chotomized at their medians. This resulted in two2 x 2 x 2 cross-tables (see Table 5).The interaction between the (2 x ) four cell-fre-

quencies in Table 5 for the high-divergent-produc-tion ability group was tested by estimating thiseffect in a loglinear model (Fienberg, 1980) by

where J1ijk ln(llCateS the interaction ettect betweenreading and evaluation ability for the high-diver-gent-production ability group, and J3ijk is a contrast

Table 4Parameter Estimates Under the

Restrictions of the Format Model

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 9: Validity of Multiple- Reading Comprehension

9

Table 5

Cross-Tablulation of the Number of Respondents Per Cellfor Divergent Production Abilities (DMR and DMU), Evaluation

Abilities (EMI + EMS + EMU) and Reading Comprehension___

(L - Low; H - High)

assigned to the cell frequencies of cell xij, (subjectto the restriction 2~ =0). If the contrasts areselected according to the specified hypothesis (i.e.,0 for the four low-divergent-production ability cellsand 1, -1, -1, 1 for the high-production abilitycells), ~Lij, can be estimated. The large-sample var-iance of the contrast was estimated by

The standardized effect was then estimated by di-viding the estimated effects by their standard de-viations. Table 6 presents the results of this analysisfor both divergent-production abilities.The data in Table 6 imply that both effects were

significant. Students with high divergent-produc-tion ability scores and high evaluation ability scoresevaluated their answers on the reading comprehen-sion task. Students who had high divergent-pro-duction abilities but low evaluation abilities readless well. This supports the importance of evalu-ation abilities for answering reading comprehen-sion items.

Discussion

It was not possible to demonstrate a substantialdifference in intellectual abilities measured witheither open-ended items or multiple-choice itemsfor reading comprehension. Therefore, open-endedand multiple-choice items for reading comprehen-sion are evidently congeneric with respect to thesi abilities measured. Because a relatively largeproportion of the variance in true reading compre-hension scores was accounted for by the si abilitytests, it is tempting to conclude that if semantic

abilities are involved differentially in answeringreading comprehension items with a different for-mat, they only play a minor role.The lack of a substantial difference due to item

format does not mean there are no differences atall between items in different formats. It only meansthat the differences in semantic abilities tested dueto the format of the items seem small relative tothe differences in abilities tested with (largely)comparable reading comprehension tests. Further-more, the lack of a substantial difference due toitem format only concerns semantic abilities, andno generalization to non-semantic abilities is war-ranted.

The results of this study do not suggest that theanswering processes for different item formats areidentical. Respondents may obtain the same totalreading comprehension score using the same in-tellectual skills, although the processes involveddiffer, without one being more efficient than theother (Sternberg, 1980).

Table 6Results of the Analysis of an

interaction Effect (p. ’k) BetweenDivergent Production Ability,Evaluation Ability, and ReadingComprehension for the High-

Divergent-Production Ability Group

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 10: Validity of Multiple- Reading Comprehension

10

The results imply that answering multiple-choicequestions is not solely a matter of comparing anddeleting alternatives. Convergent- and divergent-production abilities appear to be important as well.Nor did the answering of open-ended items proveto be solely a question of production abilities, asindicated by the interaction between divergent-pro-duction abilities, evaluation abilities, and readingcomprehension. Good readers seem to evaluate theiranswers on both divergent-production and readingcomprehension items, whereas students with highdivergent-production abilities and low evaluationabilities seem to go astray because of their lack ofevaluation of their answer with reference to eitherthe text or the item they were supposed to answer.This conclusion poses a problem for most researchon the basis of the si model, because it is usuallyassumed that intellectual tasks such as reading canbe described as a linear combination of single siabilities; interaction effects are usually neglected.The results suggest, therefore, that students seem

to construct their answers to multiple-choice itemsto the same degree as when they answer open-ended reading comprehension items. Of course,these results are not generalizable to a situation inwhich a compare-and-delete strategy for answeringmultiple-choice items is explicitly taught, as maybe the case in the last weeks or months before

important examinations (Gipps, Steadman, Black-stone, & Stiener, 1983; Goslin, 1967).Memory abilities proved to be correlated with

the answering of reading comprehension items. Thisis in concordance with models of text comprehen-sion, in which the activation and revision of thetext representation forms an essential part. The im-portance of memory for semantic systems, togetherwith the importance of evaluation of semantic sys-tems, both indicating the activation and revisionof text representations, can be interpreted as sup-port for these models.

The difference in abilities tested between the two

reading comprehension tests observed in Table 4has two possible explanations. The first concernsthe different structures of the texts for which theitems were constructed. In the second text (TestB) more transformations had to be recognized andproduced, whereas the first text (Test A) was more

explicit. On the other hand, the differences in abil-ities measured might also be due to small differ-ences in items, although this did not appear in theclassification of the items.

References

Ackerman, B. P. (1984a). The effects of storage andprocessing complexity on comprehension repair in

children and adults. Journal of Experimental ChildPsychology, 37, 303-343.

Ackerman, B. P. (1984b). Storage and processing con-straints on integrating storage information in childrenand adults. Journal of Experimental Child Psychol-ogy, 38, 64-92.

Benson, J. (1981). A redefinition of content validity.Educational and Psychological Measurement, 41, 793-802.

Benson, J., & Crocker, L. (1979). The effects of itemformat and reading ability on objective test perfor-mance. Educational and Psychological Measurement,39, 381-387.

Biemiller, A. (1977). A relationship between oral read-ing rates for letters, words and simple texts in thedevelopment of reading achievement. Reading Re-search Quarterly, 12, 259-266.

Birenbaum, M., & Tatsuoka, K. K. (1987). Open-endedversus multiple-choice response formats&mdash;it does makea difference for diagnostic purposes. Applied Psycho-logical Measurement, 11, 385-395.

Bracht, G. H., & Hopkins, K. D. (1968). Comparativevalidities of essay and objective tests (Research PaperNo. 20). Boulder CO: Laboratory of Educational Re-search, University of Colorado.

Brown, J. (1976). An analysis of recognition and recalland of problems in their comparison. In J. Brown(Ed.), Recall and recognition. London: Wiley.

Carter, H. D., & Crone, A. P. (1940). The reliabilityof new-type or objective tests in a normal classroomsituation. Journal of Applied Psychology, 24, 353-368.

Cattell, R. B. (1963). Abilities&mdash;their structure, growthand action. Boston: Houghton Mifflin.

Cook, D. L. (1955). An investigation of three aspectsof free-response and choice-type tests at college level.Dissertation Abstracts, 1351.

Coombs, C. H., Milholland, J. E., & Womer, F. B.(1956). The assessment of partial knowledge. Edu-cational and Psychological Measurement, 16, 13-37.

Croft, A. C. (1982). Do spelling tests measure the abilityto spell? Educational and Psychological Measure-ment, 42, 715-723.

Cronbach, L. J. (1966). New light in test strategy fromdecision strategy. In A. Anastasi (Ed.), Testing prob-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 11: Validity of Multiple- Reading Comprehension

11

lems in perspective. Washington DC: American Coun-cil on Education.

Culpepper, M. M., & Ramsdale, R. (1983). A com-parison of multiple-choice and essay tests of writingskill. Research in the Teaching of English, 41, 295-298.

Davis, F. B. (1968). Research in comprehension in read-ing. Reading Research Quarterly, 3, 499-535.

Doolittle, A. E., & Cleary, T. A. (1987). Gender-baseddifferential item performance in mathematics achieve-ment items. Journal of Educational Measurement, 24,157-166.

Fienberg, S. E. (1980). The analysis of cross-classifiedcategorical data. Cambridge: MIT Press.

Frary, R. B. (1985). Multiple-choice versus free-re-

sponse : A simulation study. Journal of EducationalMeasurement, 22, 21-31.

Gillund, G., & Shiffrin, R. M. (1984). A retrieval modelfor both recognition and recall. Psychological Review,9, 2-67.

Gipps, C., Steadman, S., Blackstone, T., & Stiener, B.(1983). Testing children: Standardized testing in localeducational authorities and schools. London: Hei-nemann.

Goslin, D. A. (1967). Teachers and testing. New York:Russell Sage Foundation.

Guilford, J. P. (1971). The nature of human intelligence.London: McGraw-Hill.

Guilford, J. P. (1979). Cognitive psychology with a frameof reference. San Diego: EDITS.

Guilford, J. P., & Hoepfner, R. (1971). The analysis ofintelligence. New York: McGraw-Hill.

Gulliksen, M. (1950). Theory ofmental tests. New York:Wiley.

Guttman, L. (1965). A faceted definition of intelligence.In K. Eiferman (Ed.), Studies in psychology, ScriptaHierosolymitana 14 (pp. 166-181). Jerusalem: He-brew University.

Heim, A. W., & Watts, R. B. (1967). An experimenton multiple-choice versus open-ended answering in avocabulary test. British Journal of Educational Psy-chology, 37, 339-346.

Hoeks, J. (1985). Vaardigheden in begrijpend lezen[Abilities in reading comprehension]. Unpublisheddoctoral dissertation, University of Amsterdam.

Hogan, T. P. (1981). Relationships between free-re-sponse and choice type tests of achievement: A reviewof literature. Washington DC: National Institute ofEducation.

Hopkins, K. D., George, C. A., & Williams, D. D.(1985). The concurrent validity of standardizedachievement tests by content area using teachers’ rat-ings as criteria. Journal of Educational Measurement,22, 177-182.

Hopkins, K. D., & Stanley, J. C. (1981). Educationaland psychological measurement and evaluation (6th

ed.). Englewood Cliffs NJ: Prentice-Hall.Hurd, A. W. (1930). Comparison of short-answers and

multiple-choice type tests covering identical subjectcontent. Journal of Educational Research, 26, 28-30.

J&ouml;reskog, K. G. (1971). Statistical analysis of sets ofcongeneric tests. Psychometrika, 36, 109-132.

J&ouml;reskog, K. G., & S&ouml;rbom, D. (1986). USREL: Analysisof linear structural relationships by maximum likeli-hood, instrumental variables and least squares meth-ods. Uppsala, Sweden: University of Uppsala.

Just, M. A., & Carpenter, P. A. (1980). A theory ofreading: From eye fixations to comprehension. Psy-chological Review, 87, 329-354.

Kinney, C. B., & Eurich, A. C. (1938). A summary ofinvestigations comparing different types of tests. Schooland Society, 36, 540-544.

Kintsch, W., & van Dijk, T. A. (1978). Toward a modelof discourse comprehension of simple prose. Psycho-logical Review, 85, 363-394.

Klein-Braley, C. (1983). A cloze is a cloze is a question.In J. W. Oller (Ed.), Issues in language testing re-search. Rowley MA: Newbury House.

Lorch, R. F., Lorch, E. P., & Matthews, P. D. (1985).On-line processing of the topic structure of a text.Journal of Memory and Language, 24, 350-362.

Lorch, R. F., Lorch, E. P., & Mogan, A. (1987). Taskeffects and individual differences in on-line process-ing of the topic structure of a text. Discourse Pro-cesses, 10, 63-80.

Lord, F. M., & Novick, M. R. (1968). Statistical the-ories of mental test scores. Reading MA: Addison-Wesley.

Mellenbergh, G. J. (1971). Studies in studietoetsen

[Studies in educational testing]. Unpublished doctoraldissertation, University of Amsterdam.

Meuffels, B. (1982). Studies over taalvaarrdigheid[Studies in language abilities]. Unpublished doctoraldissertation, University of Amsterdam.

Mezynsky, K. (1983). Issues concerning the acquisitionof knowledge: Effects of vocabulary training on read-ing comprehension. Review of Educational Research,53, 253-279.

Murphy, R. J. (1982). Sex differences in objective testperformance. British Journal of Educational Psy-chology, 52, 213-219.

Neville, M. H., & Pugh, A. K. (1977). Context in read-ing and listening: Variations in approach to cloze tasks.Reading Research Quarterly, 11, 12-31.

Paterson, D. G. (1926). Do old and new type exami-nations measure different mental functions? Schooland Society, 24, 246-248.

Samson, D. M. M. (1983). Rasch and reading. In J. vanWeeren (Ed.), Practice and problems in languagetesting 5. Arnhem, The Netherlands: CITO.

Shohamy, E. (1984). Does the testing method make a

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 12: Validity of Multiple- Reading Comprehension

12

difference? The case of reading comprehension. Lan-guage Testing, 1, 147-169.

Spache, G. D. (1963). Toward better reading. Cham-paign IL: Gerrard.

Spearman, C. (1927). The abilities of man. London:McMillan.

Stemberg, R. J. (1980). A proposed resolution of curiousconflicts in the literature of linear syllogisms. In R.S. Nickerson (Ed.), Attention and performance (vol.VII). Hillsdale NJ: Erlbaum.

Stemberg, R. J., & Powell, J. S. (1983). Comprehend-ing verbal comprehension. American Psychologist, 38,878-893.

Thurstone, L. L., & Thurstone, T. G. (1941). Factorialstudies of intelligence. Psychometric Monographs, No.2.

Traub, R. E., & Fisher, C. W. (1977). On the equiv-alence of constructed-response and multiple-choice tests.Applied Psychological Measurement, 1, 355-369.

van den Bergh, H. (1987). Open en meerkeuzevragentekstbegrip: Een onderzoek naar de relatie tussen pres-taties op en meerkeuzevragen [Open-ended and mul-tiple-choice items: A study of the relationships be-tween achievements on open-ended and multiple-choiceitems for reading comprehension]. Tijdschrift voorOnderwijsonderzoek, 12, 304-312.

van den Bergh, H. (1989). De correlationele structuurvan enkele schriftelijke taalvaardigheden [The corre-lational structure of several language abilities].Tijdschrift voor Taalbeheersing, 11, 38-59.

van den Bergh, H., Eiting, M., & Otter, M. (1988).Differenti&euml;le effecten van vraagvorm bij aardrijks-kunde en natuurkunde examens [Differential effects

of item format in geography and science exams].Tijdschrift voor Onderwijsonderzoek, 13, 270-284.

Vernon, P. E. (1962). The determinants of reading com-prehension. Educational and Psychological Measure-ment, 22, 269-286.

Ward, W. C. (1982). A comparison of free-response andmultiple-choice forms of verbal aptitude tests. AppliedPsychological Measurement, 6, 1-11.

Ward, W. C., Frederiksen, N., & Carlson, S. B. (1980).Construct validity of free-response and machine-scor-able forms of a test. Journal of Educational Mea-surement, 17, 11-29.

Weiss, D., & Jackson, R. (1983). The validity of de-scriptive tests of language skills: Relationships to di-rect measures of writing skill ability and grades inintroductory college English courses (Report LB-83-4 ; ETS-RR-83-27). New York: College Entrance Ex-amination Board.

Acknowledgments

This research was funded by the Institute of EducatioracalResearch (svo) in The Netherlands (Project 4075). Theauthor wishes to thank Mindert Eiting, Jan Elshout,Gideon Mellenbergh, and Bert Schouten for their adviceand comments.

.~uth&reg;r’s Address

Send requests for reprints or further information to Huubvan den Bergh, University of Utrecht, Faculteit der Let-teren, Trans 10, 3512 JK Utrecht, The Netherlands.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/