item difficulty predictors of a multiple-choice reading...

26
257 English Teaching, Vol. 65, No. 4 Winter 2010 Item Difficulty Predictors of a Multiple-choice Reading Test Yuah V. Chon (Hanyang University) Tacksoo Shin ∗∗ (Myongji University) Chon, Yuah V., & Shin, Tacksoo. (2010). Item difficulty predictors of a multiple- choice reading test. English Teaching, 65(4), 257-282. The researchers’ experiential knowledge demonstrates that the task of predicting and controlling the difficulty level of the multiple-choice items of the College Scholastic Ability Test (CSAT) for English is substantially left to the subjective judgment of experienced item writers. The present study accordingly recognizes a need to identify item difficulty predictors and build an item difficulty prediction model to handle this pertinent issue. While taking separate interest in constructing a model for the multiple- choice reading subset of the CSAT, the study was conducted by identifying item difficulty predictor variables from previous research, and by validating the candidate predictors via questionnaires by highly experienced teacher-raters when asked to analyze reading items from the English reading subset of the preliminary CSAT (i.e., yun-hap-hak-ryuk-pyung-ga) administered in March 2009. Using multiple regression technique and maximum likelihood estimation, an item difficulty prediction model was generated. In order to check validity and applicability of the prediction model, the hypothetical model was finally tested on a subsequent version of the test administered in September 2009. This type of model building is expected to guide test developers design an item pool in accordance with special needs, such as to construct multiple test forms, which have similar mean difficulties. I. INTRODUCTION The growing importance of high-stakes tests cannot be emphasized more in the Korean context when the national tests such as the College Scholastic Ability Test (CSAT) are used as the primary measure of studentsacademic ability for gatekeeping to universities. Yuah V. Chon is the first author. ∗∗ Tacksoo Shin is the corresponding author.

Upload: duongngoc

Post on 06-Mar-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

257

English Teaching, Vol. 65, No. 4 Winter 2010

Item Difficulty Predictors of a Multiple-choice Reading Test

Yuah V. Chon∗ (Hanyang University) Tacksoo Shin∗∗ (Myongji University)

Chon, Yuah V., & Shin, Tacksoo. (2010). Item difficulty predictors of a multiple-choice reading test. English Teaching, 65(4), 257-282. The researchers’ experiential knowledge demonstrates that the task of predicting and controlling the difficulty level of the multiple-choice items of the College Scholastic Ability Test (CSAT) for English is substantially left to the subjective judgment of experienced item writers. The present study accordingly recognizes a need to identify item difficulty predictors and build an item difficulty prediction model to handle this pertinent issue. While taking separate interest in constructing a model for the multiple-choice reading subset of the CSAT, the study was conducted by identifying item difficulty predictor variables from previous research, and by validating the candidate predictors via questionnaires by highly experienced teacher-raters when asked to analyze reading items from the English reading subset of the preliminary CSAT (i.e., yun-hap-hak-ryuk-pyung-ga) administered in March 2009. Using multiple regression technique and maximum likelihood estimation, an item difficulty prediction model was generated. In order to check validity and applicability of the prediction model, the hypothetical model was finally tested on a subsequent version of the test administered in September 2009. This type of model building is expected to guide test developers design an item pool in accordance with special needs, such as to construct multiple test forms, which have similar mean difficulties.

I. INTRODUCTION The growing importance of high-stakes tests cannot be emphasized more in the Korean

context when the national tests such as the College Scholastic Ability Test (CSAT) are used as the primary measure of students’ academic ability for gatekeeping to universities.

∗ Yuah V. Chon is the first author. ∗∗ Tacksoo Shin is the corresponding author.

Page 2: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

258 Yuah V. Chon · Tacksoo Shin

In spite of the efforts of university admissions to incorporate more performance-based measures of the students’ ability through their high school academic records, the test still becomes the common indicator of the students’ relative potential ability of academic success, and this is not an exception when assessing students’ English language ability. Therefore, at the end of every academic year, the control of item difficulty becomes a critical factor of the test to be considered for valid accounts of the students’ English language ability (Chang, 2004; Jin & Park, 2004). However, there is lack of studies on item difficulty predictor variables of reading tests, and much of the prediction for item difficulty in the actual development of the CSAT so far has depended on the subjective judgment of experienced item writers so that there is need for a reliable model that can systematically predict item difficulty of a test. Also, the recent announcement of MEST (Ministry of Education, Science and Technology) to adopt and/or adapt 70% of the items from the state-run EBS (Educational Broadcasting System) practice test booklets and online lecture materials for the English section of the CSAT (Kim, 2010) now situates item writers to be more accountable when selecting and adapting existing items.1 As such, while recognizing the increasing importance of item writers to be judgmentally tuned when selecting items for national tests, the purpose of the current study is to find a (best fitted) prediction model that can help provide objective information on the difficulty level of a test when reading items are developed for high-stakes tests such as the CSAT.

Before there is description of the study, however, there needs to be explanation on two pragmatic limitations of the study. The item difficulty prediction model is based on the Classical Test Theory. Unfortunately, the advanced pyschometric wisdom (i.e., item response theory; IRT) is not applicable to the CSAT since the re-use of items that have appeared in previous CSATs is not allowed by the public. Also, psychometric information of CSAT items cannot be collected through preliminary CSATs (i.e., Su-Neung mo-i pyung-ga). Under these circumstances, it is difficult to develop an IRT model for measuring item difficulty so that we find it unachievable to find a proper test equating system (see Choi, 2008 for more on the testing context in Korea).2 A second limitation of the study was that, in order to devise an item difficulty prediction model, there was need to use data on the percentage of correct answer (PCA). However, due to test security reasons of the CSAT (i.e., information on the PCA is available only to test administrators and the relevant authorities), we extrapolated the item difficulty prediction model based on the

1 The authors are aware of the recent interest in the nationally promoted standardized English

proficiency test often referred to as the ‘Korean TOEFL,’ but this paper takes interest in the imminent policy of the Ministry of Education, Science and Technology for implementing 70% liaisons of EBS materials/ online lectures for development of CSAT.

2 Discussion on test equating methods within the psychometrics area is beyond the scope of the present study.

Page 3: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 259

PCA of a preliminary CSAT administered in March, 2009 and July, 2009.3 That is, we were justified in doing this since the two tests share identical test formats (i.e., multiple-choice with one correct option and four distractors), assess identical constructs of English reading ability, and are administered on identical type of students. In fact, in the present study, we take main interest in trying to predict difficulty of the English reading subset of the CSAT, and other tests similar in form.

II. BACKGROUND 1. Item Difficulty Predictor Variables and Reading Comprehension

While studies taking interest in exploring variables related to reading item difficulty were in vogue mostly in the 1980s to mid 1990s, researchers in the areas of educational measurement and reading were interested in exploring predictor variables that can substantially explain for the percentage of variance of reading item difficulty. The influential study conducted by Drum, Calfee, and Cook (1981) aimed to predict difficulty for multiple-choice items on reading comprehension tests by rating texts, item stems, and distractors according to variables related to structure (e.g., number of words in the stem and options, word frequency measures for the text). They found overall that vocabulary variables had a stronger effect on difficulty than those related to syntactic properties of the texts. That is, the appearance of infrequent words in the stem affected difficulty, but the major contributing factor of item difficulty resulted to be plausibility of distractors. In comparison, Embretson and Wetzel (1987) found that the response decision variables had a greater effect on item difficulty than the text-related variables when trying to predict item difficulty for multiple-choice paragraph comprehension items using models that dealt with the processing stages of both text representation and response decision.

Other variables historically considered to be relatively important in affecting comprehension difficulty are vocabulary level (Graves, 1986; Klare, 1984; Qian, 1999, 2002; Read, 2000) and syntactic complexity (Leow, 1993). The vocabulary level has been measured via syllable length, frequency of word usage (i.e., low frequency words vs. high frequency words; Nation, 2001), or type of word (i.e., content vs. function word; Davey, 1988). In fact, Qian and Read have shown that readers' vocabulary knowledge routinely correlate highly with measures of reading comprehension, and is often the single best predictor of text comprehension. Syntactic complexity has primarily been described in 3 The preliminary CSAT (i.e., yun-hap-hak-ryuk-pyung-ga) is organized and administered four

times a year by the 16 Metropolitan and Provincial Offices of Education so that university applicants can practice preparing for the actual high-stakes CSAT.

Page 4: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

260 Yuah V. Chon · Tacksoo Shin

terms of sentence length (number of words per sentence) which is expected to make texts more difficult to understand, as can be inferred from its use, together with vocabulary level, in traditional readability formulas (Graves, 1986). Other difficulty variables include paragraph length (Hites, 1950, cited in Freedle & Kostin, 1993; Yano, Long & Ross, 1994), number of paragraphs (Freedle, Fine, & Fellbaum, 1981) and abstractness of text (Paivio, 1986) where longer paragraphs and abstractness of texts are considered to make passages more difficult to comprehend.

Although not using multiple-choice methods to yield an index of comprehension difficulty, some empirical studies, by use of dependent measures such as recall of passages or decision time, have found certain variables to have an influence on comprehension difficulty. For instance, Abrahamsen and Shelton (1989) demonstrated improved comprehension of texts when full noun phrases were substituted in place of referential expressions. Their study suggests that texts with many referential expressions may be more difficult than ones with few referential expressions.

Kieras (1985) examined how students perceived the relative location of main idea information in short paragraphs to influence item difficulty. He found that most students perceived the main idea to be located early in the paragraph, whereas a few thought the main idea occurred at or near the end of the paragraph. In the same vein, Hare, Rabinowitz, and Schieble (1989) systematically varied the known location of a main idea in three locations (i.e., the opening sentence, the medial sentence or the final sentence of a paragraph). They found that correct identifications were the greatest for initial occurrence of main idea sentences so that their results give evidence for how position of clues can be found to influence item difficulty.

The prediction of item difficulty have also been conducted on the Scholastic Assessment Test (SAT) and the Graduate Record Examination (GRE) reading comprehension test by Freedle and Kostin (1991, 1992). They found abstractness of text, the number of negotiations, paragraph length and the rhetorical organization of the text to increase item difficulty. In a follow-up study, Freedle and Kostin (1993) found results similar to those they had found for SAT and GRE items when studying items on the Test of English as a Foreign Language (TOEFL).

Freedle and Kostin (1993) were able to categorize the predictor variables influencing item difficulty on the TOEFL reading test into three groups: item variables, text variables and text-by-item overlap variables. Item variables include item type, item stem (e.g., words in stem, negatives in stem), and item's correct option (e.g., answer position, number of words in correct option, number of referentials in correct option); text variables suggested by the researchers were vocabulary, concreteness/abstractness of text, subject matter of text, types of rhetorical organizations, length of text, and text referentials. Text-by-item overlap variables are those related to the relative location of key information, for instance, as to

Page 5: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 261

whether the main idea is in the first sentence, in the middle or at the end of the text. Results of the study indicated text or text-by-item variables to be better predictors of item difficulty than item-related variables.

Unlike the approach adopted in previous studies, Perkins, Gupta, and Tammana (1995) used an artificial neural network (ANN) to predict item difficulty for 29 multiple-choice reading comprehension items taken from TOEFL tests. The use of ANN, unlike the traditional statistical procedure of multiple regression, does not require the assumption of a linear relationship between the predictor variables and the dependent variable, and for this reason was preferred by the authors, but their study did not specify the types of variables for which the prediction was most sensitive.

The recent study conducted by Ozuru, Rowe, O'Reilly and McNamara (2008) examined the extent to which item and text characteristics predict item difficulty on the comprehension portion of the Gates–MacGinitie Reading Tests for the 7th–9th and 10th–12th grade levels. Detailed item-based analyses were performed on 192 comprehension questions on the basis of the cognitive processing model framework proposed by Embretson and colleagues (Embretson & Wetzel, 1987). Item difficulty was analyzed in terms of various passage features (e.g., word frequency and number of propositions) and individual-question characteristics (e.g., abstractness and degree of inferential processing), using hierarchical linear modeling. The results indicated that the difficulty of the items in the test for the 7th–9th grade level is primarily influenced by text features—in particular, vocabulary difficulty—whereas the difficulty of the items in the test for the 10th–12th grade level was less systematically influenced by text features.

As a whole, it should be noted that with the exception of Freedle and Kostin's study (1993) and that of Perkins at al. (1995), there are lack of studies focused on predicting difficulty for multiple-choice items in the context of tests taken for English as a foreign language. It is reasonable to assume that the variables that influence the level of difficulty for non-native speakers might differ from those affecting native speakers. For instance, while native speakers’ process of decoding words in a text may be quite automatic, the same process may not necessarily be the case for non-native speakers.

2. Predictor Variables in the Multiple-choice Reading of CSAT in Korea

The few studies on item difficulty with focus on predictor variables of CSAT, where English is tested as foreign language, have been conducted by Chang (2004) and Jin and Park (2004). Jin and Park (2004) took interest in finding significant predictor variables of the CSAT while Chang (2004) took separate interest in building a model for difficulty prediction of the CSAT.

Chang (2004) aimed to develop a statistical model for prediction of item difficulty of the

Page 6: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

262 Yuah V. Chon · Tacksoo Shin

English reading test of the CSAT. At the initial phase, the study investigated variables that were significantly correlated to item difficulty of the English reading test. Using the correlated variables, an instrument was designed to gather the data on item difficulty of each item of the English reading tests of 2002 and 2003 CSAT. Correlation analysis was conducted to obtain models which could predict item difficulty of the tests, and when the results were validated by applying the models to September 2003 preliminary CSAT and 2004 CSAT, the linear regression equation model from 2003 CSAT showed an acceptable level of stability and predictability. The study also revealed inferencing skills, grammatical judgment, and option plausibility to be the statistically significant predictors while it was the proportion of variance accounted by the predictors was different.

In a similar vein, Jin and Park (2004) aimed to investigate variables that would affect the difficulty of English items of the CSAT. Hypothesized variables were drawn from previous literature and the analysis of the 2003 CSAT English items. With tests of correlation and multiple regression, 24 were found to be positively correlated to the percentage of correct answers (PCA) in the 2003 CSAT English items’ analysis of the 46 hypothesized variables. The variables found to be positively correlated to PCA or related to the easiness of the items were the item type variables, such as, inferring assertion, inferring author's emotional status, understanding topic, understanding atmosphere, understanding assertion; and text variables, such as, number of text sentences, main idea first sentence, information in the last sentence. The regression analysis also showed that the 14 variables from the correlation analysis explained 49% of item difficulty.

The literature reviewed so far thus recognizes that there is lack of studies on item difficulty predictor variables in English as a foreign language context, and on constructing item difficulty predictor models for the CSAT. With having noticed the gap in research, the present study aims to gain a greater understanding of predictor variables that are likely to influence the difficulty level of the English reading subset of the CSAT. To do this, the researchers re-categorized the item difficulty predictor variables reviewed in previous research, and used a statistical analysis to cross-check the validity of the hypothetical model by applying it on another version of the test. Research questions are presented as follows:

1) Which predictor variables significantly influence reading item difficulty within

the multiple-choice English reading CSAT? 2) How can item difficulty be predicted in a statistical model (i.e., multiple

regression approach)? How can the model be used to construct multiple test forms of the English reading subsets of the CSAT with similar means of difficulty without using common items across tests?

Page 7: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 263

III. METHOD

In this section, the researchers explain how the study was conducted to 1) select the item difficulty predictors, 2) construct and administer the questionnaires for the evaluation of item difficulty predictors, 3) select and train the raters, and 4) analyze the data collected to arrive at a hypothetical model and cross-validate on another version of the preliminary CSAT, which was needed to test the validity and applicability of the prediction model.

1. Selection of Item Difficulty Predictors

The candidate item difficulty predictor variables (henceforth predictors) were extracted

from earlier studies and item analysis reports of previous CSATs (i.e., 2008, 2009 CSAT) with attention to any variables that may affect item difficulty. A group of veteran high school teachers (expert judges) were also consulted on the validity of the predictor variables selected by the researchers. As such, collecting possible predictors that would predict item difficulty and determining their effects on item difficulty established a basis for constructing a questionnaire. In sum, the reading predictors that were selected from the previous literature were paragraph length (Hites, 1950; Yano et al., 1994), incorrect option plausibility (Drum et al., 1981; Chang, 2004), information-processing unit (Freedle & Kostin, 1993), familiarity of topic (Freedle & Kostin, 1993), complexity of sentences (Klare, 1974-1975, cited in Freedle & Kostin, ibid), frequency of referents (Abrahamsen & Shelton, 1989; Freedle & Kostin, ibid), familiarity of vocabulary (Freedle & Kostin, 1993; Graves, 1986; Ozuru et al., 2008), and position of clues (Freedle & Kostin, 1993; Hare et al., 1989; Kieras, 1985).

In addition to the predictors selected from previous literature, language of options, time spent to solve item, and estimate of overall difficulty were also included by the researchers. The inclusion of language of options was considered logical based on experiential knowledge of veteran teachers who reported that the language of the options presented in multiple-choice questions may influence item difficulty (personal communication, Oct. 2009). Regarding time spent to solve item, since the average time taken to solve a reading item can be estimated to be 1.5 minutes according to calculation of the number of items and time given (e.g., 50 minutes) for the reading section, we could predict an item to be relatively more difficult when more time is taken than average to solve an item. By also including estimate of overall difficulty (i.e., predicted percentage of correct answers: Predicted PCA) as a predictor, we also intended to compare the value with the mathematical model derived from the rest of the item difficulty predictors.

Page 8: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

264 Yuah V. Chon · Tacksoo Shin

2. Questionnaire for Evaluation of Item Difficulty Predictors 1) Construction and Administration of Questionnaire

The questionnaire was constructed in Korean, and the predictors were presented according to three variable types - item variables, text variables, and item/text variables (Freedle & Kostin, 1991, 1993; see II. Background). Item variables included estimate of overall difficulty, language of options, incorrect option plausibility, time spent to solve item, and information-processing unit; text variables included familiarity of topic, paragraph length, complexity of sentences, frequency of referents, familiarity of vocabulary; and an item/text variable was position of clues (See Appendix 1 for the questionnaire). This questionnaire was to be marked by raters (see later 3. Raters and Rater Training) when asked to code the predictors after examining each item of the March 2009 preliminary CSAT (see later 2) Reading Comprehension Items for description of items).

Although there was effort to find predictors that could apply to the full set of items in the test, an exception occurred such as when the predictor position of clues could not be applied to item types on ‘understanding literal information’ (36, 37)4, ‘understanding tables’ (35), ‘grammar’ (21, 22), and ‘vocabulary’ (28, 29). Also, when the raters were asked to write an estimate of overall difficulty, they were asked to write in percentages. With regard to paragraph length, the number of words was indicated for each passage so that the raters only needed to mark the range of words according to the provided options (e.g., Between 111 and 130 words).

With the exception of language of options for which only two language options could be provided (i.e., Korean and English), the rest of the predictors (i.e., incorrect option plausibility, time spent to solve item, information-processing unit, familiarity of topic, complexity of sentences, frequency of referents, familiarity of vocabulary, and position of clues) were provided with 4-point Likert scales for raters to mark. As an example of one of the predictors, time spent to solve item, whose scale could clearly be represented as a numerical concept, is presented below.

TABLE 1

Scale for Time Spent to solve Item Time spent (seconds)

Equivalent or Less than 60 seconds 70 seconds 80 seconds Equivalent or More than 90

seconds Scale 1 2 3 4

In order to prevent raters from making biased or limited judgments, an exemplary item

4 Numbers in parentheses indicate item number of March 2009 preliminary CSAT.

Page 9: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 265

from the English section of the 2009 CSAT was also presented with the scales so as to present criteria for rating of difficulty. Regarding complexity of sentences, for instance, question no. 31 from the 2009 CSAT was presented for what is likely to be considered ‘very complex’ to the students. Here an item marked ‘very complex’ would be considered more difficult than ‘complex’ (see Table 2).

TABLE 2

Scale for Complexity of Sentences Complexity of sentences Quite simple and general General Complex Very complex* Scale 1 2 3 4 Note: Very complex*= CSAT 2009, Q. 31

That is, an examination of no. 31 in Appendix B demonstrates complexity of the sentences from the students’ perspective. Although there are not any inverted sentence structures, the use of negatives (i.e., rarely, do not), and frequent use of referents (e.g., them, these marks, their carriers, their designers) are likely to distract the learners’ automatic decoding process of the text. As a whole, the item also requires some intersentential processing ability5 to arrive at the two most appropriate sentence connectors (see Appendix A for scales of the other predictors.)

2) Reading Comprehension Items

Reading items of the March 2009 preliminary CSAT were used for coding of difficulty levels with regard to the respective predictors. Since the focus of the study is on reading, 24 reading items were selected from the full 50 item set of the CSAT (see Table 3). 3. Raters and Rater Training

After constructing the questionnaire, raters were selected for coding the scales of the predictors. The raters were two university faculty members and four high school teachers who were considered to be experts i-n the field of English education. Their ages ranged from being 35 to 46, and the teachers had had more than 8 years experience of having taught in high schools. Most importantly, all the raters had been sequestered on past occasions as item writers for the English section of the high-stakes CSAT.

Before the actual coding of the scales, however, a training session was held with the six raters on five other CSAT items to ensure that the raters understood the procedure for

5 Intersentential processing refers to the reader’s ability to process text across sentence within

paragraphs, involving the use of either adjacent or distant sentences in comprehending the text message.

Page 10: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

266 Yuah V. Chon · Tacksoo Shin

rating the scales of the predictors. They were asked to check if they understood the meaning of each predictor (e.g., incorrect option plausibility), the directions for marking the scales of the predictors (e.g., Indicate the number of options that may be considered as answers to the item.), and how the predictors should be coded by marking one of the options (see Appendix A for illustration of the options). After checking that the raters had understood the procedure for coding the predictors, the six raters were asked to code the predictors of the 24 reading comprehension items of the preliminary March 2009 CSAT (Also refer back to 1). Construction and Administration of Questionnaire in how the procedure was taken so as to ensure the raters maintained objectivity for rating the scales.

TABLE 3

Distribution of Reading Item Types Domain Item Type March 2009 Preliminary CSAT

Literal Understanding Understanding Tables 35

Understanding literal Information (2) 36, 37

Inferences Inferring referents (2) 18, 19 Inferring purpose of passage 20 Inferring missing word 24 Inferring missing phrase (3) 25, 26, 27 Inferring connectives 30 Inferring author's emotional status 31 Inferring the main idea (4) 32, 33, 38, 39 Inferring assertion 34 Inferring title (2) 41, 42 Inferring atmosphere 43

Grammar Grammar (2) 21, 22 Vocabulary Vocabulary (2) 28, 29

TOTAL number of Items 24 Items

4. Data Analysis

After the reliability of the predictor variables was investigated via questionnaires by

asking trained raters6 to analyze reading items from the English reading subset of the preliminary CSAT (i.e., yun-hap-hak-ryuk-pyung-ga) administered in March 2009,

6 To obtain the bivariate correlation and construct the item difficulty prediction model, total 6

raters were asked to respond the questionnaire so that total sample size (n1) equaled to 144 (6 raters by 24 items). For the analysis of cross validation, 5 experts participated. Thus, the sample size (n2) was 120 (5 raters by 24 items).

Page 11: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 267

bivariate correlations between variables were checked so as to pre-select variables that could affect item difficulty. Thereafter, the item difficulty prediction model was constructed, and the test results of the March 2009 preliminary CSAT (dependent variable) were regressed onto the raters' responses for each of the item difficulty predictors (independent variables). Here the regression coefficients were obtained using restricted maximum likelihood estimation (REML)7. Model fits were further examined by chi-square test, Model AIC8, and R-square values. In the final stage of data analysis, the hypothetical model was tested on the July 2009 preliminary CSAT in order to test the validity and applicability of the prediction model (step for cross validation).

IV. RESULTS AND DISCUSSION 1. Descriptive Information and Preliminary Analysis

Before the main analysis was conducted to build the difficulty predictor model, a preliminary analysis was carried out in order to exclude predictors that seemed statistically invalid. The results on the descriptive statistics of the 10 predictors are first produced as in Table 4 which includes results on the mean, SD, and the percentage of responses for each of the options on the 1-4 point Likert scales.9

7 Restricted ML is one of estimation techniques. The literatures note that it provides better (robust)

parameter estimate under small sample size condition than ML estimation. 8 Model AIC = -2 (log likelihood function) + 2p. “p” is a number for parameters, and small values

suggest the better fit. 9 For the frequency, correlation, and constructing model analyses, 144 by 11 matrix data structure

is formulated. 144 is obtained through 6 rater Ⅹ 24 items and 11 notes total 10 item difficulty predictor variables including one dependent variable that is a ‘percentage of incorrect answers’ (PIA).

Page 12: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

268 Yuah V. Chon · Tacksoo Shin

TABLE 4 Item Difficulty Predictor Variables

Mean Median Mode SD Options on

ScaleFrequency of

responsesPercentage of

responses

Language of Options

1.71 2 2 0.456

1 42 29.2% 2 102 70.8% 3 N/A4 N/A

Option

Plausibility

2 2 1 0.968

1 53 36.8% 2 52 36.1% 3 25 17.4% 4 14 9.7%

Time Spent

2.28 2 2 0.942

1 34 23.6% 2 51 35.4% 3 44 30.6% 4 15 10.4%

Information Processing

Unit 3.03 3 3 0.885

1 8 5.6% 2 30 20.8% 3 56 38.9% 4 50 34.7%

Familiarity of Topic 2.24 2 2 0.795

1 25 17.4% 2 66 45.8% 3 46 31.9% 4 7 4.9%

Paragraph Length 1.94 2 2 0.600

1 30 20.8% 2 92 63.9% 3 22 15.3% 4 0 0

Complexity of Sentences

2.25 2 2 0.753

1 17 11.8% 2 84 58.3% 3 33 22.9% 4 10 6.9%

Frequency of Referents 1.76 2 1 0.830

1 64 44.4% 2 58 40.3% 3 15 10.4% 4 7 4.9%

Familiarity of

Vocabulary 2.01 2 2 0.757

1 33 22.9% 2 83 57.6% 3 21 14.6% 4 7 4.9%

Position of Clues 2.78 3 2 0.955

1 12 8.3% 2 43 29.9% 3 47 32.7% 4 42 31.1%

Note: SD = standard deviation; N/A= not applicable

Page 13: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 269

As seen in Table 4, similar central tendencies (i.e., mean, median, and mode) of the predictors and small deviations (i.e., standard deviation) across the predictors indicate the extent to which raters coded the difficulty levels for the predictors to be at a similar level. However, when calculation of the reliability coefficients 10 of the predictors with Cronbach's alpha11 was conducted (to check on consistency across raters on the set of measurement), the results showed that a variable may need to be excluded for the extrapolation of the model (see Table 5).

TABLE 512

Reliability Coefficients of Item Difficulty Predictor Variables Cronbach's AlphaFamiliarity of topic .793Complexity of sentences .719Time spent .844Familiarity of vocabulary .379Incorrect option plausibility .563Position of clues .654Information processing unit .587Paragraph length .964Frequency of referents .623

Results on the reliability coefficient showed familiarity of vocabulary to be relatively

low which suggests that the raters' responses on the variable varied on certain items. As such, it was deemed necessary to exclude the variable from the pool of predictors for building of the item difficulty prediction model due to its low consistency.

We conducted one more preliminary analysis with tests of correlation to examine the relationship between the predictors and observed percentage of incorrect answers (PIA),13

10 For reliability analysis of 10 item difficulty predictors, total ten 6 (raters) by 24 (items) matrix

data structures were created. Following reliability formula, sum of each item’s variability and total observed item variability were computed.

11 Alpha coefficient is a measure of internal consistency reliability and indicates that the ratings from a group of judges hold together to measure a common dimension.

12 The reliability coefficient for language of options is not presented in Table 5 since the scale of this predictor variable is not determined by the rater but is directly coded based on options (i.e., as to whether it is presented in Korean or English) of test items. Thus, the scales of language of options were same across raters and thus the reliability coefficient for language of options is equal to one.

13 Percentage of incorrect answers (PIA) was computed by ‘100-percentage of correct answer (PCA).’ The reason for using the percentage of incorrect answer (PIA) in this study is to pursue consistency across scales of variables. In the questionnaire, large scale points indicated high level of difficulty. Likewise, items with a large number percentage of incorrect answers are also likely to be difficult. As such, when building a model and evaluating the validity of the model, percentage of incorrect answers was used.

Page 14: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

270 Yuah V. Chon · Tacksoo Shin

which allowed us to further evaluate whether certain variables can properly predict item difficulty (see Table 6). The explanation for this is that, if some variables are highly associated with percentage of incorrect answers (PIA), then those variables, likewise, are highly likely to predict item difficulty. As seen in Table 6, language of options had the highest relationship with the observed PIA (r=.586) whereas time spent to solve item (r=.506), information-processing unit (r=.402), incorrect option plausibility (r=.394), and position of clues (r=.382) showed moderate associations with PIA. The correlation coefficient between the expected PIA and observed PIA was .560. Among item predictor variables, paragraph length and familiarity of vocabulary were not significantly related with the observed PIA. As such, through the reliability and correlation analyses, we found that some of variables (i.e., paragraph length, familiarity of vocabulary) would not be helpful for constructing the item prediction model due to low reliability and non significant relationships. Here the low correlation with regard to paragraph length may coincide with the results of the study conducted by Mehrpour and Riazi (2004) which suggests that difference in text length would affect performance on reading comprehension only to a very marginal extent, particularly with advanced language learners. Also, with respect to familiarity of vocabulary, the nonsignificant result may be surprising considering that vocabulary has been known to be a primary factor in verbal comprehensions throughout much of the history of modern psychological and educational research (Hudson, 2007). However, this seems to have occurred when some of the raters did not see vocabulary as a primary predictor for item difficulty since in-service teachers and professors who have been sequestered to write items for past CSATs are aware that words in the reading passages are controlled to be within the vocabulary level limit as stipulated in the curriculum and the nationally authorized English textbooks. If a word does go beyond the curriculum, there will be annotations for unknown vocabulary (e.g., *artifact: 인공물) to help with the examinee’s decoding process of the text (see Appendix B, item no. 31).

All in all, with regard to research question 1, the significant correlations we saw for language of options, time spent to solve item, information-processing unit, incorrect option plausibility, and position of clues indicates that they deserve paying attention for the control of item difficulty. In particular, it is reasonable to expect language of options to be related to item difficulty since options (distractors) written in English (L2) in comparison to Korean (L1) are likely to involve greater cognitive demand on the learners when they try to decode the options to find the key to a multiple-choice reading item.

Page 15: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 271

Page 16: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

272 Yuah V. Chon · Tacksoo Shin

2. Construction of Item Predictor Model

Based on the preliminary analysis, 8 item predictor variables (i.e., language of options, incorrect option plausibility, time spent, information processing unit, familiarity of topic, complexity of sentences, frequency referents, and position of clues) which were initially pre-selected from the analyses of reliability and correlation were considered for the difficulty prediction model with the multiple regression techniques. This is called the full model. Then, the finalized models were constructed by excluding item difficulty predictors which were nonsignificantly related with the percentage of incorrect answer. The results are shown as in Table 7.

TABLE 7 Item Difficulty Prediction Models

Predictor Variables Full Model Finalized Model

Regression CoefficientIntercept 3.853** 2.551**

Language of options 16.863***(.660)

15.994***(.663)

Incorrect option plausibility .900*(.058)

1.261**(.109)

Time spent 3.944**(.284)

4.132***(.312)

Information processing unit 1.668**(.152)

1.577***(.171)

Familiarity of topic -1.212(1.076)

Complexity of sentences 1.249*(.510)

Frequency of referents -.921(2.074)

Position of clues 1.685**(.148)

1.815***(.169)

Model AIC 225.017 231.254Chi-square 315.017*** 321.254***

R-square .533(0.730)

.507(.712)

Note: * p<.05; ** p<.01 ; *** p <.001. Parentheses indicate the standardized coefficients. From the full model, familiarity of topic, complexity of sentences, and frequency of

referents were negatively and nonsignificantly associated with percentage of incorrect answers (PIA). Accordingly, these three variables were excluded for rerun of the model. With repetition of the same procedure, the finalized item predictors which positively and

Page 17: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 273

significantly affect PIA resulted to be language of options, incorrect option plausibility, time spent to solve item, information-processing unit, and position of clues. In the fit indices, the full model (r-square=.533) reproduced data slightly better than the finalized model (r-square=.507). Also, when considering chi-square and model AIC, the full model showed a better fit. However, these differences between the full and finalized models were not dramatically significant. This fact indicates that the simpler model (i.e., the finalized model) also fit acceptably. Since parsimonious structures can statistically yield increased efficiency in the form of statistical power, the finalized model was chosen to predict PIA in the present study.

When the standardized solution indicating the degree of standardized relationship (i.e., -1.0~1.0) between predictors (e.g., item difficulty predictors) and the dependent variable (i.e., PIA) from the finalized model was checked, language of options indicated to be the most powerful variable for predicting PIA with a standardized coefficient of .663. With regard to the other predictors, PIA seemed to be affected by the variables in the order of time spent to solve item (=.312), information-processing unit (=.171), position of clues (=.169), and incorrect option plausibility (=.109). Lastly, it was found that the association between the model driven PIA and the observed PIA (=.712) was higher than the correlation between raters' expected PIA and observed PIA (=.560) (Refer back to Table 7). This may also suggest that the systematic approach via use of the item difficulty prediction model is more reliable than when difficulty is predicted through the raters' holistic approach of predicting difficulty. With the finalized model, the equation for the item difficulty prediction model could be produced as follows where Ŷ indicates the predicted PIA.

Ŷ= 2.551+15.994(Langauge of Options)+ 1.261(Incorrect Option Plausibility)+ 4.132(Time Spent)+ (1) 1.577(Information Unit)+ 1.815(Position of Clues)

3. Evaluation on the Validity of the Item Difficulty Prediction Model

To evaluate the validity of the item difficulty prediction model, PIA of the items of the July 2009 preliminary CSAT were regressed onto the difficulty predictor model (equation 1). For this analysis, we again asked raters to rate the scales of the predictors. The model driven predicted PIA and observed PIA again highly correlated to each other (r=.683). As such, the PIA obtained from the model showed better accuracy than the raters' expectation. That is, the relationship between raters' expected prediction and observed PIA tended to be

Page 18: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

274 Yuah V. Chon · Tacksoo Shin

low as .387. As presented in Table 8, with strong association, the overall percentage of incorrect answers (PIA) (=51.013) obtained from the model was very close to the observed PIA (=52.204) so that the model produced in the present study can be considered reliable for predicting item difficulty for CSAT type exams.

TABLE 8 (n2 = 120)

Observed, Predicted, and Raters' Expected PIA Observed PIA Predicted PIA Raters' Expected PIA

Mean 52.204 51.013 39.2SD 16.218 10.890 7.026

As a final step of the analysis, the study examined the differences between observed PIA

and predicted PIA as well as the observed PIA and raters' expected PIA for each item.

TABLE 9 Difference in Percentage of Incorrect Answers (PIA) for Items

Item No. Observed PIA -Predicted PIA

Observed PIA -Raters' Expected PIA

18 14.06 12.519 -12.29 7.120 17.26 21.521 2.12 14.422 27.65 35.124 -16.27 3.625 5.18 21.726 23.93 3827 -6.93 928 0.49 15.229 -1.86 8.930 -9.42 5.431 -7.75 10.732 -7.24 -6.833 1.52 17.634 -9.94 -3.135 3.11 8.336 -3.04 2.437 -9.54 -4.6

As seen in Table 9, differences between observed PIA and predicted PIA tended to be

smaller than those between observed PIA and raters' expected PIA indicating again that our difficulty prediction model performed better than that produced through the holistic judgment of the raters. In fact, over 20% differences were found between observed PIA and predicted PIA in items 22 and 26, but these big differences were due to the extremely low observed PIAs (respectively 8.50% and 14%) of the two items.

Although the regression approach was used to find the best predicted line in a way to

Page 19: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 275

reduce the difference between observed and predicted points, a shortcoming of the approach is that it may not be effective for predicting very high and/or low scores in one line when applying the regression technique for item difficulty prediction. However, this regression technique is expected to have minor effects for predicting overall PIA as seen in the difference between mean scores of observed PIA and expected PIA (see Table 9). In this regard, this model is considered to be effective for implementation when equating the overall PIA across test forms, such as for the English section of the CSAT.

V. CONCLUSION

1. Main Research Findings The present study tried to produce an item difficulty prediction model that could be

applied to test forms such as the CSAT. The study was conducted by identifying item difficulty predictor variables from previous research, and validating the candidate predictors via questionnaires by highly experienced teacher-raters when asked to solve reading items from the English reading subset of the preliminary CSAT administered in March 2009. Using multiple regression technique and restricted maximum likelihood estimation, an item difficulty prediction model was generated. The hypothetical model was tested on a subsequent version of the test administered in September 2009 in order to check validity and applicability of the prediction model. As a result, the study as a whole produced a model with strong predictability showing correlations of .712/.683 between the model driven PIA and observed PIA. Based on how we saw a nonsignificant difference between the observed PIA and predicted PIA, the model can also be claimed to be effective for predicting the overall mean scores between different sets of the English section of the CSAT, and also for tests similar in form. Through use of the difficulty prediction model, we also expect item writers to be able to take control of overall mean scores when trying to equate multiple test forms for similar mean difficulties. However, our model extrapolated for prediction of test difficulty would be most effective for predicting overall difficulty rather than for individual test items of the CSAT type (i.e., multiple-choice formats) (see below for limitations).

In the model, language of options, incorrect option plausibility, time spent to solve item, information-processing unit, and position of clues resulted to be the statistically significant predictors for prediction of difficulty. In fact, previous studies (Chang, 2004; Drum et al., 1981) have also corroborated the importance of incorrect option plausibility for the prediction of difficulty. Ozuru et al. (2008) have indicated that the quality of distractor options in multiple-choice tests is an important attribute of assessment although the

Page 20: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

276 Yuah V. Chon · Tacksoo Shin

variable itself is associated more strongly with general test-taking skills than with reading comprehension skills. Thus, this also indicates that tests should try to extensively measure many different subcomponents underlying reading comprehension in the context of a variety of reading circumstances such as through a variety of passages with varying ranges of difficulty, and different types of questions, most of which cannot be answered by simply eliminating distractors.

2. Limitations and Recommendations for Research

From a statistical perspective, a possible problem of the study is that since the influence

of the predictor variable, such as language options (standardized solution=.663) was unusually large, research needs to be conducted on larger samples of test items for prediction of difficulty and model building. Also, since this model is based on the Classical Test Theory (CTT), the purpose of this model is to predict the overall PIA, not to obtain psychometric information of each item. On this point, the availability of this model would be limited in its use.

In the same vein, the present study also recognizes a limitation in the fact that it is difficult to extrapolate an item difficulty prediction model for each item of a test, but this would still remain a problem even when Item Response Theory can be utilized in the context of our study. In fact, although the use of actual data from the high-stakes CSAT would have been more convincing for cross-validation of the prediction model, it must be understood that the Korean Ministry of Education, Science and Technology currently treats the information as confidential and does not allow public use of those.

The situation is exacerbated when the reuse of items on previous national tests is not allowed by the public, and this makes it difficult for item developers to design tests to maintain similar difficulty levels across tests. Nevertheless, attempts to arrive at an item difficulty prediction model, such as that of the present study, are expected to result in a number of practical benefits. Test developers would be better able to design an item pool in accordance with special needs; content specifications for tests can be more precisely defined; and the items written for a particular text would be better adapted to the difficulty level of that text.

Another limitation of this study that may be combated by future research is that the item difficulty predictors in the present study were only investigated through the evaluations of expert teacher-raters. However, our prediction on test difficulty may need to be triangulated by ratings from a reliable sample of current high school students so as to investigate difficulty predictors that may be significantly meaningful from the students' perspectives (Lee, 2002). In fact, investigation on the predictors that may have an influence on the outcome of the test, such as for exams of the CSAT, is being investigated at the

Page 21: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 277

moment by one of the researchers of the study based on actual (observed) percentage of correct answers of individual English CSAT items, stanine levels (i.e., 1-9)147of the students, students' use of study materials and study styles.

Another area for potential research is that, since the present study has been conducted only with interest in the reading section of the CSAT, there needs to be further investigation and model building for the other skills of the language, such as for listening, speaking and writing. For example, studies with the other receptive skill (i.e., listening) have been conducted on trying to predict difficulty of the TOEFL listening comprehension items (Freedle & Kostin, 1996), or to explore the relationship between a set of item characteristics and the difficulty of TOEFL dialogue items (Kostin, 2004; Nissan, DeVincenzi, & Tang, 1996). Also, with interest and developments in the nationally promoted standardized English proficiency test often referred to as the 'Korean TOEFL,' further item difficulty prediction models may need to be built, for instance, by using predictor variables and models built in the present study as a foundation since there are scarcity of studies that have been conducted on predicting item difficulty in EFL contexts and for nationally conducted English exams in the Korean context.

Last but not least, a word of guidance may be provided for forthcoming researchers within the field of item difficulty. For studies related to item difficulty, it is crucial to be able to obtain the opinions of expert judges/teachers who are experienced in item writing and/or in dealing with the target students to be examined on the relevant tests. Also, in order to maintain reliability in the design of questionnaires for rating of item difficulty predictors, it would be equally important to provide objective criteria (such as through providing sample items) for scale ratings. For instance, what may be considered to be a ‘complex’ sentence for one rater may be considered ‘simple’ to another.

For future research, the authors of the present study see that there is much value in conducting the present type of study particularly in the Korean context. When the use of common (anchor) items is not allowed for high-stakes test, such as the CSAT, the alternative solution to this may be achieved through the use of the item difficulty prediction model. The model would enable item writers and test administrators to be able to construct different tests while also maintaining similar means across tests.

14 A stanine is a type of scaled score used in many norm-referenced standardized tests, such as the

CSAT. There are nine stanine units (the term is short for "standard nine-point scale"), ranging from 9 to 1. Typically, stanine scores are interpreted as above average (9, 8, 7), average (6, 5, 4), and below average (3, 2, 1).

Page 22: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

278 Yuah V. Chon · Tacksoo Shin

REFERENCES

Abrahamsen, E. P., & Shelton, K. C. (1989). Reading comprehension in adolescents with learning disabilities: Semantic and syntactic effects. Journal of Learning Disabilities, 22(10), 596-572.

Chang, K.-S. (2004). A model of predicting item difficulty of the reading test of College Scholastic Ability Test. Foreign Languages Education, 11(1), 111-130.

Choi, I.-C. (2008). The impact of EFL testing on EFL education in Korea. Language Testing, 25(10), 39-62.

Davey, B. (1988). Factors affecting the difficulty of reading comprehension items for successful and unsuccessful readers. Journal of Experimental Education, 56(2), 67-76.

Drum, P. A., Calfee, R. C., & Cook, L. K. (1981). The effects of sentence structure variables on performance in reading comprehension tests. Reading Research Quarterly, 16, 486-514.

Embretson, S. E., & Wetzel, C. D. (1987). Component latent trait models for paragraph comprehension tests. Applied Psychological Measurement, 11(2), 175-193.

Freedle, R., Fine, J., & Fellbaum, C. (1981, August). Predictors of good and bad essays. Paper presented at the annual Georgetown University Roundtable on languages and linguistics. Washington, DC.

Freedle, R., & Kostin, I. (1991). The prediction of SAT reading comprehension item difficulty for expository prose passages (P/J 969-60). Princeton, NJ: Educational Testing Service.

Freedle, R., & Kostin, I. (1992). The prediction of GRE reading comprehension item difficulty for expository prose passages (ETS Research Report RR-91-59). Princeton, NJ: Educational Testing Service.

Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading item difficulty: implications for construct validity. Language Testing, 10(2), 133-167.

Freedle, R., & Kostin, I. (1996). The prediction of TOEFL listening comprehension item difficulty for minitalk passages: Implications for construct validity (ETS-RR-96-29). New Jersey: Educational Testing Service.

Graves, M. F. (1986). Vocabulary learning and instruction. Review of Research in Education, 13, 49-89.

Hare, V., Rabinowitz, M., & Schieble, K. (1989). Text effects on main idea comprehension. Reading Research Quarterly, 24, 72-88.

Hites, R. W. (1950). The relation of readability and format to retention in communication. Unpublished doctoral dissertation, Ohio State University.

Hudson, T. (2007). Teaching second language reading. Oxford: Oxford University Press.

Page 23: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 279

Jin, K.–A., & Park, C. (2004). The prediction of English item difficulty in college scholastic ability test. English Teaching, 59(1), 267-278.

Kieras, D. E. (1985). Thematic processes in prose. In B. Britton & J. Black (Eds.), Understanding expository text (pp. 11-64). Hillsdale, NJ: Erlbaum.

Kim, S.-Y. (2010). EBS Suneung yunjae chool je. Chosun IlBo. Retrieved May 20, 2010, from the World Wide Web: http://news.chosun.com/site/data/html_dir/2010/03/31/ 2010033101639.html.

Klare, G. R. (1974-1975). Assessing readability. Reading Research Quarterly, 1, 1-47. Klare, G. R. (1984). Readability. (Signed chapter.) In P. D. Pearson (Ed.), Handbook of

Reading Research (pp. 681-744). New York: Longman. Kostin, I. (2004). Exploring item characteristics that are related to the difficulty of TOEFL

dialogue items (ETS-RR-4-11). New Jersey: Educational Testing Service. Lee, J.-W. (2002). An exploratory study on reading comprehension test-taking processes

and strategies in the EFL context. English Teaching, 57(4), 177-195. Leow, R. (1993). To simplify or not to simplify: A look at intake. Studies in Second

Language Acquisition, 15(3), 333–56. Mehrpour, S. & Riazi, A. (2004). The impact of text length on EFL students’ reading

comprehension, Asian EFL Journal, 6(3). Retrieved May 20, 2010, from the World Wide Web: http://www.asian-efl-journal.com/Sept_04_sm_ar.pdf.

Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.

Nissan, S., DeVincenzi, F., & Tang, K. L. (1996). An analysis of factors affecting the difficulty of dialogue items in TOEFL listening comprehension (ETS-RR-95-37). New Jersey: Educational Testing Service.

Ozuru, Y., Rowe, M., O'Reilly, T., & McNamara, D. S. (2008). Where's the difficulty in standardized reading tests: The passage or the question? Behavior Research Methods, 40(4), 1001-1015.

Paivio, A. (1986). Mental representations. New York: Oxford University Press. Perkins, K., Gupta, L., & Tammana, R. (1995). Predicting item difficulty in a reading

comprehension test with an artificial neural network. Language Testing, 12(1), 34-53.

Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary knowledge in reading comprehension. The Canadian Modern Language Review, 56(2), 282-308.

Qian, D. D. (2002). Investigating the relationship between vocabulary knowledge and academic reading performance: An assessment perspective. Language Learning, 52(3), 513-36.

Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press. Yano, Y., Long, M. H., & Ross, S. (1994). The effect of simplified and elaborated texts on

foreign language reading comprehension. Language Learning, 44(2), 189-219.

Page 24: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

280 Yuah V. Chon · Tacksoo Shin

APPENDIX A Questionnaire for Evaluation of Item Difficulty

외국어(영어)영역의 문항 난이도에 대한 설문조사 (읽기를 중심으로)

안녕하십니까? 본 설문은 외국어(영어) 영역의 문항 난이도 변인에 관한 현직 교사들의 의견을

조사하기 위한 것입니다. 2009 학년도 3 월 고 3 전국연합학력평가 문제지를 바탕으로 응답지에

해당 번호를 기입하여 주십시오. 이 설문지를 통해 주신 의견은 외국어(영어)영역의 문항 난이도

를 예측하는데 유용하게 활용될 것입니다. 바쁘신 중에도 설문에 응해 주셔서 대단히 감사합니다.

1. 예상 정답률: 1% 단위(자연수)로 기입한다. 답안지에는 % 부호를 생략하고 적는다. 2. 답지의 언어

언어 한국어 영어 척도 1 2

3. 오답지의 매력도: 정답으로 착각할 수 있는 매력이 있는 오답지의 개수를 나타낸다. 매력 있는 오답지의 수 1-2 개 3 개 4 개 5 개

척도 1 2 3 4 4. 문제해결 소요시간 : 문제해결에 걸리는 시간

소요시간 (초)

60 초 이하(매우 짧음)

70 초(짧음)

80 초(조금 걸림)

90 초 이상 (매우 걸림)

도 1 2 3 4 5. 정보 처리 단위 : 문제를 해결하기 위하여 필요한 정보의 처리 단위

단위의 대상 단어나 어구의 이해

두세 문장의이해*

지문 전체의사실적 이해

지문 전체의 종합적, 추론적 이해

척도 1 2 3 4 *두 세 문장의 이해: 2009 년 본수능 43 번 6. 지문 내용이나 소재의 생소성 : 지문에서 다루는 내용이나 제재가 주는 생소한 인상의 정도

생소성 매우 친숙 친숙 생소 매우 생소*

척도 1 2 3 4 * 매우 생소: 2009 년 본수능 43 번 (철학적, 추상적 내용임)

Page 25: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

Item Difficulty Predictors of a Multiple-choice Reading Test 281

7. 지문의 길이 [지문에 기입되어 있는 단어수 참조]

지문의 길이 110 단어 이하111-130 단어

사이131-150 단어

사이151 단어 이상*

척도 1 2 3 4 * 151 단어 이상: 2009 년 본수능 29 번 (158 단어) 8. 문장 구조의 복잡성 : 문단을 구성하고 있는 개별 문장의 복잡성 정도

문장의 복잡성매우 단순하고

일반적 일반적 복잡 매우 복잡*

척도 1 2 3 4 * 매우 복잡: 2009 학년도 본수능 31 번 9. 지시어 사용 빈도수 (100 단어 내외 지문 기준)

지시어 거의 사용되지 않음(0-1 개)

약간 사용(2-3 개)

다소 자주 사용(4-5 개)

매우 자주 사용 (6 개 이상)

척도 1 2 3 4 10. 어휘의 생소성 : 지문에서 사용된 어휘들의 생소성

생소성 매우 익숙 익숙 생소* 매우 생소 척도 1 2 3 4

*생소: 2009 학년도 본수능 37 번 11. 단서의 위치 : 문제해결에 필요한 정보의 지문 내 위치

[* 내용일치/불일치(2 문항), 도표(1 문항), 어휘(2 문항), 어법(2 문항) 문항은 제외]

단서의 위치 문단 도입 부분 문단 중간 부분 문단의 말미* 문단 내 암시적으로 제시

척도 1 2 3 4 *문단의 말미: 2009 년 본수능 25 번

APPENDIX B An Item from 2009 College Scholastic Ability Test

31. 다음 글 의 빈칸 (A), (B)에 들 어갈 말로 가장 적절한 것은 ?

Sheets of paper exist almost entirely for the purpose of carrying information, so we tend to think of them as neutral objects. We rarely interpret marks on paper as references to the paper itself. (A) , when we see the text, characters, and images on artifacts that serve other purposes, we generally interpret these marks as labels that do refer to their carriers. Natural objects do not come with labels, of course, but these days, most physical artifacts do. (B) , their designers have chosen to shift part of the burden of communication from the form and materials of the artifact itself to lightweight surface symbols. So, for example, a designer of door handles might not worry about

Page 26: Item Difficulty Predictors of a Multiple-choice Reading Testjournal.kate.or.kr/wp-content/uploads/2015/02/kate_65_4_11.pdf · Item Difficulty Predictors of a Multiple-choice ... item

282 Yuah V. Chon · Tacksoo Shin

communicating their functions through their shapes, but might simply mark them ‘push’ and ‘pull.’ *artifact: 인공물

(A) (B) ① However …… Otherwise ② Likewise …… In contrast ③ However …… That is ④ Besides …… In contrast ⑤ Besides …… That is

Applicable levels: secondary education, tertiary education, general education Key words: Item difficulty prediction model, multiple-choice, reading comprehension, College

Scholastic Ability Test Yuah Vicky Chon College of Education (English Education), Hanyang University 17 Haengdang-dong, Seongdong-gu Seoul 133-791 Office: 02-2220-1144 Email: [email protected] Shin, Tacksoo College of Social Science (Youth Education and Leadership), Myongji University 50-3 Namgajwadong, Seodaemun-gu Seoul 120-728 Office: 02-300-0621 Email: [email protected] Received in September 2010 Reviewed in October 2010 Revised version received in December 2010