validating languedge™ courseware scores against faculty ... · software (two full-length tests of...

RESEARCH REPORT April 2003 RR-03-11

Validating LanguEdge™ Courseware Scores Against Faculty Ratings and Student Self-assessments

Research & Development Division Princeton, NJ 08541

Donald E. Powers Carsten Roever Kristin L. Huff Catherine S. Trapani

Validating LanguEdge™ Courseware Scores Against

Faculty Ratings and Student Self-assessments

Donald E. Powers, Carsten Roever, Kristin L. Huff, and Catherine S. Trapani

Educational Testing Service, Princeton, NJ

April 2003

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:

Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541

Abstract

LanguEdge™ Courseware is a software tool that is designed to help teachers of English as a

second language (ESL) build and assess the communicative skills of their students. The purpose

of this study was to generate information to help LanguEdge Courseware users understand better

the meaning (or validity) of the assessment scores bases on the LanguEdge Courseware.

Specifically, the objective was to describe, for each of the four sections of the LanguEdge

assessment, relevant characteristics of test takers at various test score levels. To accomplish this

objective, we gathered data that represent two different perspectives—those of instructors and

those of students themselves.

Approximately 3,000 students each took one of two parallel forms of the LanguEdge

assessment at domestic and international testing sites. Participants also completed a number of

self-assessment questions about their English language skills. In addition, for some study

participants, instructors rated selected language skills.

LanguEdge test scores related moderately (correlations mostly in the .30s and .40s) with

student self-assessments. Of the four LanguEdge tests, Listening exhibited the strongest

relationships to self-assessments; Speaking, the next strongest; Reading, the next; and Writing,

the least.

The correlations of faculty ratings with each of the LanguEdge section test scores were

generally in the .40s, with some reaching the .50s. The correlations between the various student

self-assessment scales and faculty ratings were modest, mostly in the .30s. These correlations

suggest that students and faculty had different perspectives on students’ English language skills.

i

As isolated entities, summary test scores, even when accompanied by normative data, are

not especially informative about what test takers know and can do. In an effort to make test

scores more useful, some testing programs—for example, the National Assessment of

Educational Progress (NAEP)—have implemented relatively sophisticated reporting procedures

in order to facilitate test score interpretations. One such effort, generally known as proficiency

scaling, is usually but not always based on item response theory (IRT) methods (Beaton & Allen,

1992) and entails procedures such as the following. Several ability levels are selected on an

overall ability/proficiency score scale. For each of these levels, individual items are selected

such that, at a given level of ability, examinees have a specified probability (say, 80%) of

answering each item correctly. At lower levels of ability, however, examinees have a

significantly lower probability of answering each of these items correctly, but a high probability

of answering some other set of items correctly. Experts then judge the items that examinees

correctly answer at each level in order to characterize examinee proficiency at various score

points (see, for example, Mullis & Jenkins, 1988; Beaton & Allen, 1992).

The resulting scales have a number of attractive features. They are, however, not entirely

problem-free. For example, the proficiencies that underlie success on test items at various score

levels are not always readily inferred, especially when the domains being tested are either

multidimensional or ill defined. Such attempts can give rise to questionable inferences about

examinee proficiency (Forsyth, 1991), possibly because test users do not adequately understand

the score reports (Hambleton & Slater, 1994).

Another noteworthy aspect of proficiency scaling is that it is internally focused. That is,

proficiency scales are given meaning by referencing performance on the test items that the scales

comprise. Because score levels are interpreted according to the items that determine scores, the

method may appear to be circular. At the least, the method has a bootstrapping nature insofar as

it makes use of existing resources (i.e., test items) to improve an existing state (i.e., test score

interpretations).

In contrast, the effort undertaken here approached test score meaning from an external

perspective. The aim was to relate test score levels to nontest, external indicators of examinees’

language proficiency. The test scores of interest were those based on the LanguEdge

Courseware software.

1

Overview

LanguEdge Courseware (http://www.toefl.org/languedge.html) is a professional

development tool designed to help teachers of English as a second language (ESL) build and

assess the communicative skills of their students. The courseware package consists of interactive

software (two full-length tests of reading, writing, speaking, and listening) and supporting

materials (a teacher's guide, a scoring handbook, and a score interpretation guide). The package

is based on the likely test format of a future version of the Test of English as a Foreign

Language™ (TOEFL®), which will employ tasks that integrate speaking and writing with

reading and listening.

The purpose of this study was to try out procedures that might, eventually, prove useful

for generating information to help LanguEdge courseware users better understand the meaning

(or validity) of LanguEdge test scores. Specifically, the objective was to describe, for each of

four sections of the test, relevant characteristics of test takers at various test score levels, thereby

helping to establish the validity of test score distinctions among test takers. To accomplish this

objective, we gathered data that represent two different perspectives—those of instructors and

those of students themselves. The collection of multiple sources of information is consistent

with commonly accepted standards for test validation (Messick, 1989; American Educational

Research Association, 1999).

Instructors’ assessments of students’ English language skills were gathered because

teachers seem well-positioned to judge the academic skills of their students. The (less obvious)

rationale for collecting student self-assessments was as follows. Self-assessments of various

sorts—self-reports, checklists, self-testing, mutual peer assessment, diary-keeping, log books,

behaviorally anchored questionnaires, global proficiency scales, and “can-do” statements

(Oscarson, 1997)—have proven to be useful indicators in a variety of evaluation contexts,

especially in the assessment of language skills. Upshur (1975), for instance, noted that language

learners typically have a wider view of their successes and failures than do external evaluators.

More generally, Shrauger and Osberg (1981) concluded that there is substantial evidence, both

empirical and conceptual, that self-assessors frequently have both the information and the

motivation to make effective judgments about themselves.

2

Methods

Sample Selection

In the spring of 2002, approximately 3,000 candidates were recruited both internationally

and domestically (United States and Canada) to participate in a field study of the LanguEdge

Courseware. Each of these students took one of two parallel forms of the LanguEdge assessment

at one of 18 domestic and 12 international test sites. After deleting records for test takers whose

motivation was questionable, usable test data were available for 2,703 test takers.

The field study sample was generally representative of the TOEFL population in terms of

native language. A majority (60%) of field study participants came from the following native

language groups: Chinese (18%), Spanish (13%), Arabic (7%), Korean (7%), Japanese (5%),

French (4%), Indonesian (3%), and Latvian (3%). These groups constitute approximately 61%

of the TOEFL test-taking population and are represented in the following proportions: 23%, 5%,

5%, 12%, 13%, 2%, 1%, and <1%, respectively. The field study sample was also generally representative of the TOEFL population in

terms of level of English language proficiency as measured by the paper-and-pencil TOEFL.

Both the domestic and international field study subsamples performed slightly better on each

section of the TOEFL than did their domestic and international counterparts who took the

TOEFL. The mean scores on the Listening, Structure, and Reading sections, which range from

20 to 67 (or 68), were, respectively, 53.7, 51.8, and 52.9 for the study sample. The same mean

scores for the TOEFL operational test population were 52.6, 49.3, and 51.6 (domestic test takers)

and 50.5, 50.7, and 52.6 (international test takers). The differences between the study sample

and the operational testing population were relatively small, ranging from approximately .03 to

.34 standard deviation units on each of the three scales (listening, structure, and reading).

Procedure/Instruments

Each study participant took the LanguEdge assessment along with a retired paper-based

TOEFL test (TOEFL PPT). LanguEdge has four sections, corresponding to the four modalities

of communication: Listening, Reading, Speaking, and Writing. The LanguEdge assessment is

composed of several different item types, including (a) conventional four-choice, single-correct-

answer multiple-choice items, (b) multiple-choice items requiring one or more correct responses,

3

(c) extended written response (essay) items, and (d) spoken response items. Productive response

items (i.e., the Speaking and Writing items) require evaluation by trained human raters and are

worth 1 to 5 points each.

With respect to scoring, raw score totals for Listening and Reading are calculated by

summing the number of points awarded for each item answered correctly. Classical

equipercentile equating methods were used to equate Listening and Reading scores across the

two forms of the assessment. In addition to being equated, Listening and Reading scores were

linearly scaled to have a minimum value of 1 and a maximum value of 25, respectively.

There are five Speaking tasks and three Writing tasks in each form of LanguEdge. Several

of these tasks are designed to reflect the integrated nature of communicative language ability. One

of the Speaking tasks is integrated with Listening (Listening/Speaking) and the other with Reading

(Reading/Speaking). These tasks require examinees to either read or to listen to a stimulus and

then to speak about it. Similarly, there are two integrated Writing tasks that are administered as

part of the Listening and Reading sections (i.e., Listening/Writing and Reading/Writing). The

remaining tasks (three Speaking and one Writing) are referred to as independent tasks, as responses

do not require examinees to read or listen to an extended verbal stimulus. Scores on the five

Speaking tasks and scores on the three Writing tasks comprise the Speaking and Writing total

scores, respectively. Scores for these sections of the assessment have not been scaled or equated.

Instead, scores are reported as the average of scores on each of the tasks.

Before they were tested, participants were also asked to complete a number of questions

about their English language skills. Several kinds of self-assessment questions were developed.

Two sets of can-do type statements were devised on the basis of reviews of existing statements

(e.g., Tannenbaum, Rosenfeld, Breyer, & Wilson, 2003) and with regard to the claims being

made for LanguEdge. Only statements that concerned academically related language

competencies, not more general language skills, were written. One set (19 items) asked test

takers to rate (on a 5-point scale ranging from “extremely well” to “not at all”) their ability to

perform each of several language tasks. The other set (20 items) asked test takers to indicate the

extent to which they agreed or disagreed (on a 5-point scale ranging from “completely agree” to

“completely disagree”) with each of several other can-do statements. For each set,

approximately equal numbers of questions addressed each of the four language modalities

(Listening, Reading, Speaking, and Writing).

4

Test takers were also asked to compare (on a 5-point scale from “a lot higher” to “a lot

lower”) their English language ability in each of the four language modalities with those of other

students—both in classes they were taking to learn English and also, if applicable, in subject

classes (biology or business, for example) in which the instruction was in English. Test takers

were also asked to provide a rating (on a 5-point scale from “extremely good” to “poor”) of their

overall English language ability.

Finally, test takers who had taken some or all of their classes in English were asked to

indicate (on a 5-point scale ranging from “not at all difficult” to “extremely difficult”) how

difficult it was for them to learn from courses because of problems with reading English or with

understanding spoken English. They were also asked to indicate how much difficulty they had

encountered when attempting to demonstrate what they had learned because of problems with

speaking English or with writing English.

In addition to completing self-assessment questions about their language skills, study

participants who tested at U.S. sites (but not international sites) were also asked to contact two

people who had taught them during the past year and to give each one the Faculty Assessment

Form. Study participants were asked to contact only people who had had some opportunity to

observe their English language skills. Participants were told that after faculty completed the

forms, faculty would mail the envelopes directly to us.

The instructions that accompanied the Faculty Assessment Form asked faculty to provide

their opinions about the student’s English language skills. Specifically, instructors were told that

Educational Testing Service was developing a new TOEFL to facilitate the admission and

placement of nonnative speakers of English in academic programs in North America and that, in

conjunction with this effort, we were gathering a variety of information about the students who

had taken the first version of the test in order to establish more firmly the meaning of scores on

the new assessment. Instructors were also told that they had been asked to provide information

because they had had relevant contact with the student who had contacted them. Finally, they

were informed that their assessment would be treated confidentially and would not be shared

with anyone, including the student.

The Faculty Assessment Form asked faculty to indicate (on a 5-point scale ranging from

“not successful at all” to “extremely successful”) how successful the student had been at

(1) understanding lectures, discussions, and oral instructions

5

(2) understanding (a) the main ideas in reading assignments and (b) written instructions for

exams/assignments

(3) making him/herself understood by you and other students during classroom and other

discussions

(4) expressing ideas in writing and responding to assigned topics.

Faculty were also asked to compare (on a 7-point scale ranging “well below average” to

“well above average”) the student’s overall command of English with that of other nonnative

English students they had taught.

For each question, instructors were allowed to omit their rating, if appropriate, and to

respond instead that they had not had adequate opportunity to observe the student’s language

skills. Instructors were also asked to indicate their current position or title, the approximate

number of nonnative speakers of English they had taught at their current and previous academic

institutions, and just how much opportunity they had had to observe the student’s facility with

the English language (little if any, some, a moderate amount, or a substantial amount).

A final item on the form requested the faculty member’s telephone number and e-mail

address “for verification purposes only.” This item was included only to discourage study

participants from completing the form themselves.

Results

Student Self-assessments

It was important to first establish the extent to which test takers were consistent in

reporting about their own language skills. For this purpose, 4-, 5-, or 6-item scales were formed

by summing responses to individual items having the same response format (e.g., “how well” or

“agree”) for each language modality. Table 1 shows the number of items that comprised each

scale, as well as the internal consistency reliability estimate (coefficient alpha) for each of the

various scales. As is clear, each of the various scales exhibits reasonably high internal

consistency, ranging from a low of .81 (for four items asking students to compare their English

language skills with those of other students in English language classes) to .95 for the five-item

scale asking students to rate how well they could perform various reading tasks.

6

Table 1

Reliability Estimates for Language Skill Self-assessments

Scale Number

of items

Coefficient

alpha

“How well” scales

Listening 5 .93

Reading 5 .95

Speaking 5 .93

Writing 4 .89

Composite 19 .97

“Agreement” scales

Listening 4 .88

Reading 6 .92

Speaking 5 .89

Writing 5 .91

Comparison scales

Students in ESL classes 4 .81

Students in subject courses

4 .88

Overall English ability 4 .84

Difficulty with English 4 .85

Note. Ns for scales range from 2,235 to 2,629 due to nonresponse to some questions.

The internal consistency reliability estimates for the LanguEdge test sections were .88,

.89, .80, and .76 for the Listening, Reading, Speaking, and Writing sections, respectively. The

intercorrelations among LanguEdge section scores ranged from .57 between Reading and

Speaking to .76 between Listening and Reading. All other intercorrelations were in the mid to

high .60s.

Table 2 shows the correlations of each of the various student self-assessment scales with

performance on each section of LanguEdge. Generally, test scores related least strongly to the

7

scales on which students were asked to compare their abilities to those of other students. They

related most strongly, generally, to the various can-do scales (both those using a “how well”

response format and those using an “agree” format). Of the four LanguEdge tests, Listening

most often exhibited the strongest relationships to self-assessments; Speaking, the next strongest;

Reading, the next; and Writing, the least.

Table 2

Correlations of Self-assessment Scales With LanguEdge Scores

LanguEdge score

Self-assessment scale M SD Listening Reading Speaking Writing

“How well” Scales

Listening 12.8 4.1 .47(.50) .31(.33) .49(.55) .29(.33)

Reading 12.4 4.0 .46(.49) .41(.43) .42(.47) .31(.36)

Speaking 14.0 4.1 .33(.35) .18(.19) .43(.48) .19(.22)

Writing 11.4 3.1 .36(.38) .26(.28) .41(.46) .26(.30)

Composite 51.0 13.5 .46(.49) .32(.34) .48(.54) .29(.33)

“Agreement” Scales

Listening 8.4 2.9 .48(.51) .34(.36) .46(.51) .28(.32)

Reading 12.9 4.3 .51(.54) .43(.46) .44(.49) .32(.37)

Speaking 11.0 3.7 .41(.44) .28(.30) .44(.49) .26(.30)

Writing 11.4 3.7 .40(.43) .31(.33) .40(.45) .28(.32)

Composite 43.8 13.3 .49(.52) .37(.39) .48(.54) .31(.36)

Comparison Scales

Students in ESL classes 10.5 2.7 .25(.27) .14(.15) .33(.37) .16(.18)

Students in subject courses

11.1 3.1 .16(.17) .07(.07) .21(.23) .04(.05)

(Table continues)

8

Table 2 (continued)

LanguEdge score

Self-assessment scale M SD Listening Reading Speaking Writing

Overall English ability 11.2 3.0 .36(.38) .22(.23) .44(.49) .21(.24)

Difficulty with English 8.1 2.9 .40(.43) .29(.31) .40(.45) .24(.28)

Note. Ns range from 2,235 to 2,616 for Reading and Listening, from 818 to 952 for Speaking, and from 1,117 to 1,303 for Writing. The different Ns reflect mainly that all responses could not be scored in time to meet the schedule for data analysis. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores.

Faculty Ratings

Faculty returned ratings for 819 of the study participants. For 637 participants, two

ratings were available. The sample for whom faculty ratings were returned had slightly lower

LanguEdge scores on average but was reasonably representative of the total study sample in

terms of the range of test performances.

Faculty who returned rating forms described their positions or titles as follows: faculty

member (45%), teaching assistant (11%), ESL instructor (38%), and other (6%). Nearly all

respondents reported having had an opportunity to observe the student’s facility with English—

either some (17%), a moderate amount (40%), or a substantial amount (41%). (About 1% of the

respondents said they had had little if any opportunity to observe the student’s English language

skills, and so they were deleted from the analysis.) Respondents reported having taught various

numbers of nonnative speakers of English at their current and previous academic institutions, with

6% having taught fewer than 10 such students, 25% from 10 to 100, and 70% more than 100.

A scale consisting of all four faculty ratings (one for each language modality) was highly

internally consistent, exhibiting a coefficient alpha of .91. Table 3 shows the agreement statistics

between pairs of faculty raters for each of the four ratings, plus those for a fifth, which is an

overall rating of students’ language skills. As can be seen, the agreement rates are modest,

indicating that instructors did not agree completely, possibly because of different perspectives,

about the English language skills of the students they taught. Rates of exact agreement ranged

from 39% to 50%, and rates of agreement that were exact or within one point ranged from 74%

9

to 94%. Correlations between pairs of faculty raters ranged from .47 to .52, and Cohen’s kappa

ranged from .21 to .26. Weighted kappas ranged from .33 to .39. (Kappa values of .21 to .40

have been described by Landis and Koch [1977] as “fair.”)

Table 3

Agreement Statistics for Faculty Ratings

Statistic

Faculty rating Exact agreement (%)

Exact or adjacent (%)

r Kappa Weighted kappa

In general, how successful has this student been:

in understanding lectures, discussions, and oral instructions

49.8

94.0

.52

.26

.39

at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments

47.3

92.7

.47

n.e.

n.e.

at making him/herself understood by you and other students during classroom and other discussions

47.0

90.5

.51

.25

.37

at expressing ideas in writing and responding to assigned topics

44.9

89.2

.47

.21

.33

Compared to other nonnative English students you have taught, how is this student’s overall command of the English language?

38.9

73.7

.49

.21

.36

Note. N = 637 test takers for whom two faculty ratings were available. n.e. = not estimable.

Table 4 shows the correlations of faculty ratings (mean of two ratings when available)

with each of the LanguEdge section test scores. With few exceptions, these correlations are all

in the .40s, with some reaching the .50s. The correlations between the various student self-

assessment scales and faculty ratings were modest, ranging from .09 to .41, with a majority

10

(65%) falling in the .30s. These correlations suggest that students and faculty had different

perspectives on students’ English language skills.

Table 4

Correlation of Instructor Ratings With LanguEdge Scores

LanguEdge test score

Faculty rating L R S W

In general, how successful has this student been:

in understanding lectures, discussions, and oral instructions

.49(.52)

.42(.45)

.47(.53)

.36(.41)

at understanding (a) the main ideas in reading assignments and (b) written instructions for exams/assignments

.47(.50)

.45(.48)

.42(.47)

.40(.46)

at making him/herself understood by you and other students during classroom and other discussions

.43(.46)

.36(.38)

.42(.47)

.35(.40)

at expressing ideas in writing and responding to assigned topics

.45(.48)

.43(.46)

.42(.47)

.42(.48)

Composite Rating (sum of the four above) .52(.55)

.47(.50)

.51(.57)

.44(.50)

Compared to other nonnative English students you have taught, how is this student’s overall command of the English language?

.51(.54)

.45(.48)

.53(.59)

.41(.47)

Note. Ns range from 400 to 465 for Writing, from 260 to 303 for Speaking, from 716 to 819 for Listening and for Reading. All correlations are significant at the .001 level or beyond. Entries in parentheses have been corrected for attenuation due to unreliability of LanguEdge scores.

Characteristics of Test Takers at LanguEdge Score Levels

Tables 5-8 show for each LanguEdge test section the relationships between score level and

both student self-assessments and instructor ratings. Table entries are percentages of either students

or instructors who gave various responses to each question. For instance, the first line in Table 5

11

shows, by test takers’ score level, the percentages of instructors who judged that students at the score

level had been more than moderately successful (i.e., very successful or extremely successful) in

understanding lectures, etc. Each table contains only the assessments and ratings for the language

modality matching the test section. For example, Table 5 shows that, for Listening scores, 32% of

faculty participants felt that test takers who scored at the lowest level (1-5) had been more than

moderately successful in understanding lectures, discussions, and oral instructions. On the other

hand, students who scored at the highest level on the Listening test (21-25) were judged much more

often (by 90% of faculty raters) as being more than moderately successful.

The corresponding faculty rating for (a) Reading (success at understanding main ideas in

reading assignments and written instructions for exams/assignments), (b) Speaking (success at

making him/herself understood by faculty and students during classroom and other discussions),

and (c) Writing (success in expressing ideas in writing and responding to assigned topics) are

shown in Tables 6, 7, and 8, respectively.

12

Table 5

Key Descriptors of LanguEdge Learner

Descriptor

Faculty (%) Judging that students had been m understanding lectures, discussio

Who felt that students’ overall co above average when compared w

Students (%)

Who agreed that they could: • remember the most importan • understand instructors’ direc • recognize which points in a l • relate information that they h

Who said they did not perform w

• understanding the main ideas • understanding important fact • understanding the relationsh • understanding a speaker’s at • recognizing why a speaker is

13

s by Listening Score Level

Test score level

1-5 6-10 11-15 16-20 21-25

ore than moderately successful at ns, and oral instructions 32

42 60 78 90

mmand of English was at least somewhat ith other nonnative students they had taught 13 22 41 68 77

t points in a lecture 34 37 45 63 78 tions about assignments and their due dates 43 60 75 89 95 ecture are important and which are less so 33 41 57 72 84 ear to what they know 29 42 63 76 88ell at:

of lectures and conversations 31 29 14 5 2 s and details of lectures 36 33 19 9 5 ips among ideas in a lecture 36 32 18 11 5 titude or opinion

ething 38 43

28 32

15 16

9 10

5 4 saying som

(Table continues)

Table 5 (continued)

Test score level

Descriptor 1-5 6-10 11-15 16-20 21-25

Students (%) Who felt their listening ability was lower than that of other students in ESL classes 29

22 14 10 5

Who felt that problems understanding spoken English made learning difficult

52 41 35 22 10

14

Table 6

Key Descriptors of LanguEdge Learners by Reading Score Level

Test score level

Descriptor 1-5 6-10 11-15 16-20 21-25

Faculty (%)

Judging that students had been more than moderately successful at understanding the main ideas in reading assignments and written instructions for exams 36

54 73 83 92

Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught 20 33 50 66 83

Students (%)

Who agreed that they could:

• quickly find information in academic texts 42 49 62 75 86

• understand the most important points when reading an academic text 40 55 71 83 91

• figure out the meaning of unknown words by using context and background knowledge

34 43 60 71 83

• remember major ideas when reading an academic text 42 50 66 75 85

• understand charts and graphs in academic texts 42 53 73 84 91

• understand academic texts well enough to answer questions about them 41 45 63 75 85

15

(Table continues)

Table 6 (continued)

Test score level

Descriptor 1-5 6-10 11-15 16-20 21-25

Students (%)

Who said they did not perform well at:

• understanding vocabulary and grammar 25 26 13 5 2

• understanding major ideas 20 13 6 2 0

• understanding how the ideas in a text relate to each other 26 23 10 6 3

• understanding the relative importance of ideas 28 18 9 4 3

• organizing or outlining the important ideas and concepts in texts 29 26 12 7 4

Who felt their reading ability was lower than that of other students in ESL classes

18 14 5 4 2

Who felt that problems reading English made learning difficult 43 29 16 10 6

16

Table 7

Key Descriptors of LanguEdge Learners by Speaking Score Level

Test score level

Descriptor 1-2 2-3 3-4 4-5

Faculty (%)

Judging that students had been more than moderately successful at making himself/herself understood by during classroom and other discussions 44

57 76 86

Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught 22 34 72 83

Students (%)


• state and support their opinion 31 51 68 85

• make themselves understood when asking a question 56 70 81 93

• talk for a few minutes about a familiar topic 39 66 73 90

• give prepared presentations 38 62 78 90

• talk about facts or theories they know well and explain them in English 28 55 68 82

17

(Table continues)

Table 7 (continued)

Test score level

Descriptor 1-2 2-3 3-4 4-5

Students (%)


• speaking for one minute in response to a question 53 36 23 17

• getting other people to understand them 26 16 7 5

• participating in conversations or discussions 36 25 17 6

• orally summarizing information from a lecture listened to in English 47 38 25 8

• orally summarizing information they have read in English 40 23 16 7

Who felt their speaking ability was lower than that of other students in ESL classes 19 17 11 6

Who felt that problems speaking English made it difficult to demonstrate learning 46 41 25 13

18

Table 8

Key Descriptors of LanguEdge Learners by Writing Score Level

Test score level

Descriptor 1-2 2-3 3-4 4-5

Faculty (%)

Judging that students had been more than moderately successful at understanding the main ideas in expressing ideas in writing and responding to assigned topics

35

68 77 83

Who felt that students’ overall command of English was at least somewhat above average when compared with other nonnative students they had taught

30 61 73 88

Students (%)


• express ideas & arguments effectively when writing in English 43 62 70 76

• support ideas with examples or data when writing 48 63 77 77

• write texts that are long enough without writing too much 41 58 68 73

• organize text so that the reader understands the main and supporting ideas 51 69 80 85

• write more or less formally depending on the purpose and the reader 42 58 68 75

19

(Table continues)

Table 8 (continued)

Test score level

Descriptor 1-2 2-3 3-4 4-5

Students (%)


• writing an essay in class on an assigned topic 30 19 11 10

• summarizing & paraphrasing in writing information read in English 26 15 10 9

• summarizing in writing information that was listened to in English 39 30 20 14

• using correct grammar, vocabulary, spelling and punctuation when writing 39 28 16 12

Who felt their writing ability was lower than that of other students in ESL classes 20 13 8 8

Who felt that problems writing English made it difficult to demonstrate learning 38 29 18 17

20

Student-self assessments are shown in a similar manner in each table. For example,

Table 5 reveals that 34% of the students who obtained LanguEdge listening scores of 1-5 agreed

that they could remember important points in a lecture, whereas 78% of those at the highest level

(21-25) agreed that they could do this. We note that for all but one of the various ratings

(understanding vocabulary and grammar), percentages increase (or decrease) monotonically as

expected.

Finally, it may be useful to LanguEdge users to know how test takers viewed the various

tasks that make up the assessment, that is, how valid they appeared to be. Table 9 shows the

reactions of field study participants to each of the LanguEdge tasks. As can be seen, students

generally viewed the tasks as being appropriate ones on which to demonstrate their English

language skills. With the exception of two speaking tasks (speaking about a lecture and speaking

about a reading passage), each of the tasks was deemed by nearly 80% (or more) of test takers to

have been a good way in which to demonstrate their skills.

Table 9

Test Taker Agreement With Statements About LanguEdge Tasks

Statement Percent agreeing or strongly agreeing

Writing about a general topic was a good way to demonstrate my ability to write in English.

90

This was a good test of my ability to understand conversations and lectures in English.

82

Answering questions about single points or details in the reading text was a good way for me to demonstrate my reading ability.

82

Answering questions by organizing information from the entire reading passage into a table was a good way for me to demonstrate my reading ability.

82

(Table continues)

21

Table 9 (continued)

Statement Percent agreeing or strongly agreeing

This was a good test of my ability to read and understand academic texts in English.

80

Writing about a reading passage was a good way to demonstrate my ability to write in English.

79

Speaking about general topics was a good way to demonstrate my ability to speak in English.

78

Writing about a lecture was a good way to demonstrate my ability to write in English.

78

Speaking about a lecture was a good way to demonstrate my ability to speak in English.

65

Speaking about a reading passage was a good way to demonstrate my ability to speak in English.

62

Note. Ns range from 2,685 to 2,694.

Discussion

Although faculty ratings and student self-assessments proved to relate only modestly to

each other, both related significantly to scores on each section of the LanguEdge assessment.

LanguEdge test scores related moderately (correlations mostly in the .30s and .40s) with student

self-assessments. The correlations of faculty ratings with each of the LanguEdge section test

scores were generally in the .40s, with some reaching the .50s. Moreover, individually, each of

the faculty ratings and student self-assessment questions distinguished among test takers scoring

at different levels on the assessments. This was true for each of the four LanguEdge test

sections. The correlations between the various student self-assessment scales and faculty ratings

were modest, mostly in the .30s, suggesting that students and faculty had different perspectives

on students’ English language skills.

How do the correlations between self-assessments and test scores found in this study

compare with those detected in other efforts? The answer is “generally quite favorably.” For

instance, several reviews or meta-analyses have been conducted in which self-assessments have

22

been shown to correlate, on average, about .35 with peer and supervisor ratings (Harris &

Schaubroeck, 1988), about .29 with a variety of performance measures (Mabe & West, 1982),

about .39 with teacher evaluations (Falchikov & Boud, 1989), and in the .60s for studies dealing

with self-assessment in second and foreign languages (Ross, 1998).

The correlations computed here also compare favorably with those typically found in test

validity studies. For instance, in the context of graduate admissions, Graduate Record

Examinations® (GRE®) General Test scores generally correlate in the .20–.40 range with

graduate grade averages (Briel, O’Neill, & Scheuneman, 1993; Kuncel, Hezlett, & Ones, 2001)

and in the .30–.50 range with such criteria as faculty ratings and performance on comprehensive

examinations (Kuncel, Hezlett, & Ones, 2001).

We believe, therefore, that the validity criteria employed here (i.e., faculty judgments and

student self-assessments) may prove useful in providing additional meaning to LanguEdge test

scores. An obvious limitation of the study, however, is that we have provided no validation of

students’ self-assessments themselves. That is, we did not attempt to verify that students knew

and could actually do what they said they could do (beyond, of course, obtaining somewhat

similar ratings from faculty). Moreover, pairs of faculty members did not agree very strongly

with regard to their assessments of the students they had taught. Despite this lack of agreement

(which may simply reflect different but legitimate perspectives), LanguEdge scores correlated

significantly with faculty ratings.

The strength of the study, we believe, is that, unlike previous efforts that have relied on

internal anchors (i.e., the items constituting a test), we have enhanced test score meaning by

referencing external “anchors.” A shortcoming of this study is that relatively few anchor items

were administered, therefore precluding a more selective identification of the most

discriminating items for score interpretation. Consequently, no attempt was made to summarize

and interpret performance at the various score levels by generalizing across sets of items, as has

been the practice for internal methods for which much larger numbers of test items have usually

been available. Next steps in developing this methodology might be to take a more model-based

(rather than solely data-driven) approach in order to provide more stable estimates of the

relationships between test scores and validation criteria. In addition, a larger number of external

anchors could be administered in order to select only those that exhibit the greatest ability to

distinguish among score levels.

23

References

American Educational Research Association, American Psychological Association, & National

Council on Measurement in Education. (1999). Standards for educational and

psychological testing. Washington, DC: American Educational Research Association.

Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of

Educational Statistics, 17, 191–204.

Briel, J. B., O’Neill, K. A., & Scheuneman, J. D. (Eds.). (1993). GRE technical manual: Test

development, score interpretation, and research for the Graduate Record Examinations

Program (pp. 67–88). Princeton, NJ: Educational Testing Service.

Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis.

Review of Educational Research, 59, 395–430.

Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations?

Educational Measurement Issues and Practice, 10, 3-9, 16.

Hambleton, R. K., & Slater, S. (1994, October). Using performance standards to report national

and state assessment data: Are the reports understandable and how can they be

improved? Paper presented at the Joint Conference on Standard Setting for Large-Scale

Assessments, Washington, DC.

Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peer-

supervisor ratings. Personnel Psychology, 41, 43–62.

Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). Comprehensive meta-analysis of the

predictive validity of the Graduate Record Examinations: Implications for graduate

student selection and performance. Psychological Bulletin, 127, 162–181.

Landis, J. D., & Koch, G. G. (1977). The measurement of observer agreement for categorical

data. Biometrics, 33, 159–174.

Mabe, P. A., & West, S. G. (1982). Validity of self-evaluation of ability: A review and meta-

analysis. Journal of Applied Psychology, 67, 280–296.

Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).

Washington, DC: American Council on Education.

Mullis, I. V. S., & Jenkins, L. B. (1988). The science report card: Elements of risk and recovery.

Princeton, NJ: Educational Testing Service.

24

Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C.

Clapham & D. Corson (Eds.), The encyclopedia of language and education: Vol. 7.

Language testing and assessment (pp. 175–187). Dordrecht, The Netherlands: Kluwer

Academic Publishers.

Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of

experiential factors. Language Testing, 15, 1–20.

Shrauger, J. S., & Osberg, T. M. (1981). The relative accuracy of self-predictions and judgments

by others of psychological assessment. Psychological Bulletin, 90, 322–351.

Tannenbaum, R. J., Rosenfeld, M., Breyer, F. J., & Wilson K. (2003). Linking TOEIC scores to

self-assessments of English-language abilities: A study of score interpretation.

Manuscript submitted for publication.

Upshur, J. (1975). Objective evaluation of oral proficiency in the ESOL classroom. In L. Palmer

& B. Spolsky (Eds.), Papers on language testing 1967-1974 (pp. 53–65). Washington,

DC: TESOL.

25

validating languedge™ courseware scores against faculty ... · software (two full-length tests of...

Documents