science assessments and english language learners: validity evidence based on response processes

14
This article was downloaded by: [Colorado College] On: 10 October 2014, At: 13:15 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Applied Measurement in Education Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hame20 Science Assessments and English Language Learners: Validity Evidence Based on Response Processes Tracy Noble a , Ann Rosebery a , Catherine Suarez a , Beth Warren a & Mary Catherine O’Connor b a TERC b Applied Linguistics, Boston University Accepted author version posted online: 04 Aug 2014.Published online: 16 Sep 2014. To cite this article: Tracy Noble, Ann Rosebery, Catherine Suarez, Beth Warren & Mary Catherine O’Connor (2014) Science Assessments and English Language Learners: Validity Evidence Based on Response Processes, Applied Measurement in Education, 27:4, 248-260, DOI: 10.1080/08957347.2014.944309 To link to this article: http://dx.doi.org/10.1080/08957347.2014.944309 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms- and-conditions

Upload: mary-catherine

Post on 09-Feb-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

This article was downloaded by: [Colorado College]On: 10 October 2014, At: 13:15Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Applied Measurement in EducationPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/hame20

Science Assessments and EnglishLanguage Learners: Validity EvidenceBased on Response ProcessesTracy Noblea, Ann Roseberya, Catherine Suareza, Beth Warrena &Mary Catherine O’Connorb

a TERCb Applied Linguistics, Boston UniversityAccepted author version posted online: 04 Aug 2014.Publishedonline: 16 Sep 2014.

To cite this article: Tracy Noble, Ann Rosebery, Catherine Suarez, Beth Warren & MaryCatherine O’Connor (2014) Science Assessments and English Language Learners: ValidityEvidence Based on Response Processes, Applied Measurement in Education, 27:4, 248-260, DOI:10.1080/08957347.2014.944309

To link to this article: http://dx.doi.org/10.1080/08957347.2014.944309

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

Applied Measurement in Education, 27: 248–260, 2014Copyright © Taylor & Francis Group, LLCISSN: 0895-7347 print/1532-4818 onlineDOI: 10.1080/08957347.2014.944309

Science Assessments and English Language Learners:Validity Evidence Based on Response Processes

Tracy Noble, Ann Rosebery, Catherine Suarez, and Beth WarrenTERC

Mary Catherine O’ConnorApplied Linguistics, Boston University

English language learners (ELLs) and their teachers, schools, and communities face increasinglyhigh-stakes consequences due to test score gaps between ELLs and non-ELLs. It is essential thatthe field of educational assessment continue to investigate the meaning of these test score gaps. Thisarticle discusses the findings of an exploratory study of the response processes of grade 5 ELLsand non-ELLs on multiple-choice science test items from a high-stakes test. We found that the ELLstudents in our sample were more likely than the non-ELL students to answer incorrectly despitedemonstrating knowledge of the science content targeted by the test items. Investigating the inter-view transcripts of ELL students who answered in this way revealed that ELL students’ interactionswith specific linguistic features of test items often led to alternative interpretations of the items thatresulted in incorrect answers. The implications of this work for the assessment of ELLs in science arediscussed.

The science and mathematics tests administered in the United States to fulfill the requirementsof the No Child Left Behind Act of 2001 (NCLB, 2002) have high-stakes consequences forstudents, teachers, and schools. The consequences of low test scores can include the denial ofhigh school diplomas to students (Center on Education Policy [CEP], 2011a), the firing of teach-ers and administrators, and the restructuring of entire schools. These high-stakes consequencesdifferentially affect English language learners (ELLs) and their teachers, schools, and commu-nities, due to persistent test score gaps between ELLs and non-ELLs (CEP, 2011a; Perna &Thomas, 2009). Although the differences between the test scores of ELLs and non-ELLs aretypically referred to as achievement gaps, we believe that they are more accurately described astest score gaps (Noble et al., 2012). Despite the stated goals of NCLB to close such test scoregaps, they persist on large-scale tests at the state and national level (Barton & Coley, 2009; CEP,2011b).

The most obvious contributor to test score gaps between ELL and non-ELL students is Englishproficiency. Given the requirements of NCLB that ELLs be tested using science and mathematics

Correspondence should be addressed to Tracy Noble, Ph.D., TERC, 2067 Massachusetts Avenue, Cambridge, MA02140. E-mail: [email protected]

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 3: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 249

tests written in English after only three years in U.S. schools, many ELLs are being testedwhile they are still developing aspects of their English proficiency. Despite a significant bodyof research demonstrating that ELL students may need five to seven years to become fluent inacademic English (See Solórzano [2008] for a brief review on this topic), many states begin test-ing ELLs with science and math tests written in English during ELL students’ first or second yearin U.S. schools. Differences in cultural background between students and test item writers mayalso contribute to differences in item interpretations that lead to incorrect answers (Hill & Larsen,2000; Solano-Flores & Trumbull, 2003). Thus, there are multiple reasons to question whether thetest score gaps between ELLs and non-ELLs on science and mathematics tests reflect differencesin science and mathematics knowledge and skills.

The interpretation of a test score as a measure of a student’s science content knowledge is validif the test measures science content knowledge and not other constructs, such as English languageproficiency or cultural knowledge unrelated to science (Haladyna & Downing, 2004). Thus, it isessential that the field of educational assessment continue to investigate the validity of interpretingELL student scores on such tests as measures of their science or mathematics knowledge andskills. According to thinking in the field of measurement, specific validity evidence from ELLstudents should be gathered when there is reason to believe that their test scores may differ inmeaning from those of non-ELL students (American Educational Association [AERA], AmericanPsychological Association [APA], & National Council on Measurement in Education [NCME],1999; WestEd & Council of Chief State School Officers [CCSSO], 2007; Young, 2009).

Research on the test scores of ELL and non-ELL students has generated substantial evidencethat the meaning of test scores may differ for ELL and non-ELL students. Statistical studies haveshown that large-scale tests in reading, science, and math show lower reliability and lower cor-relations between test scores and related ability measures for ELLs than for non-ELLs (Abedi,2002). Additional studies have found that ELLs score lower than non-ELLs on test items withgreater linguistic complexity when the compared students have the same total score on all theother items on the test (Martiniello, 2009; Wolf & Leon, 2009). Finally, linguistic simplifica-tion studies have demonstrated that the language of science and mathematics test items can besimplified without changing the tested science or mathematics content, causing small but sta-tistically significant improvements in test scores for ELL students, without similar increases inthe scores of non-ELL students (Abedi, Courtney, & Leon, 2003; Abedi, Courtney, Mirocha,Leon, & Goldberg, 2005; Sato, Rabinowitz, Gallagher, & Huang, 2010). Given the body ofevidence indicating that science and mathematics test scores may differ in meaning for ELLand non-ELL students, there is an argument for collecting validity evidence to examine suchdifferences.

One way to investigate the validity of interpreting test scores as measures of knowledge andskills is through cognitive labs in which students are interviewed about how they answered testitems and/or asked them to think aloud while answering the items. The goal is to determinewhether the knowledge and skills students report using to answer test items are the same asthe knowledge and skills the items are intended to test. In the Standards for Educational andPsychological Testing (AERA, APA, & NCME, 1999) this evidence based on response processesis regarded as one of the five main sources of validity evidence. Young (2009) has proposed thatevidence based on students’ response processes should be used as one indicator of whether testscore interpretations are equally valid for ELLs and non-ELLs. Nonetheless, evidence based on

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 4: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

250 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

response processes has rarely been included in large-scale studies of the validity of interpretationsof state test scores (Schafer, Wang, & Wang, 2009; Sireci, Han, & Wells, 2008).

One example of research on ELL students’ response processes is the work of Martiniello(2008), who found that grade 4, Spanish-speaking ELL students were sometimes unable to under-stand the language of mathematics test items sufficiently to know what they were being asked todo, despite demonstrating the ability to do the required mathematical task. Students’ difficultieswith these test items often arose from linguistic or cultural features of the test items, such as wordsthat were primarily used outside of school. The ELL students interviewed had fewer difficultieswith words that were learned in school, such as mathematical terms.

From a sociocultural perspective, the language of a test item does not have a fixed interpre-tation, but an interpretation emerges from the interaction of a student with a test item. Scienceand mathematics test items are part of a specialized discourse that includes culturally and his-torically developed sets of expectations about how students will interpret the items and selecttheir answers (Gee, 1996). When science and mathematics tests are based on the language andcultural norms of the European-American, native English-speaking, middle- and upper-incomepopulation, a linguistic mismatch may occur between the features of language in test items andthe features of language with which students from other populations, such as ELLs, are famil-iar (Basterra, Trumbull, & Solano-Flores, 2011; Solano-Flores & Nelson-Barber, 2001). One ofthe main findings of sociocultural research on assessment is that variation in performance on testitems is not solely due to student knowledge and ability factors, but also due to systematic error ofmeasurement resulting from the interaction of knowledge and ability factors with student back-ground factors, such as socioeconomic status, first language, and associated meaning-makingpractices (Solano-Flores & Trumbull, 2003). Investigating students’ response processes can helpto illuminate these interactions (O’Connor, 2006).

METHODS

The study reported in this article is an exploratory study of students’ response processes in which12 ELL and 24 non-ELL grade 5 students were interviewed about their responses to six publiclyreleased science test items from the Grade 5 Science and Technology/Engineering MassachusettsComprehensive Assessment System (STE MCAS) (Massachusetts Department of Education [MADOE], 2003, 2004a, 2005). When we initially reviewed the transcripts of students’ interviews, wefound cases in which students had answered test items incorrectly but nonetheless demonstratedthe science content knowledge targeted by those test items during their interviews. We sought toexplore this phenomenon further by finding out when and why it occurred.

We identified the science content knowledge targeted by each test item (hereafter called targetknowledge) and scored student interview transcripts for the demonstration of this target knowl-edge. This allowed us to determine when students’ answer choices were consistent or inconsistentwith the target knowledge they demonstrated in the interview. We explored this consistency forour sample as a whole and for each subgroup within our sample. In addition, we reviewed allstudent interview transcripts that showed inconsistency between answer correctness and demon-strated target knowledge and looked for patterns in students’ interactions with test items that ledto these inconsistencies.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 5: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 251

Participants

Thirty-six students were recruited for the interview study by their teachers or their school prin-cipals at 11 schools in three urban school districts in Massachusetts. Our sample of students wasnot random and consisted of volunteers from schools at which we had relationships with teach-ers and/or principals. The sample consisted of 12 ELLs and 24 non-ELLs. The 12 ELLs wereclassified as Formerly Limited English Proficient (FLEP) by their schools and had been consid-ered Limited English Proficient (LEP) in the current school year or within the previous two years(MA DOE, 2004b). As they demonstrated in the interview context, they were still learning theEnglish language skills needed to interpret science assessments. Few studies focus on FLEP stu-dents, who often have significant challenges in mainstream classrooms (Young et al., 2010). Thestudents in our sample had a range of first languages that reflected the variation in first languagesspoken by ELL students in MA and included Spanish, Haitian Creole, and a number of Asianlanguages. However, first language data was not systematically collected. Ten of the 12 FLEPstudents interviewed (83%) received free or reduced-price Lunch (F/RL), which is consistentwith the proportion of LEP and FLEP students who receive F/RL in the state (79%).

The non-ELL students in the sample were identified by their teachers or principals as neitherFLEP nor LEP. From school records, we identified two groups of non-ELLs: (1) 12 non-ELLstudents who received F/RL at school, categorized as F/RL non-ELLs and (2) 12 non-ELLs whodid not receive F/RL at school and were categorized as middle class (MC) Non-ELLs.

Test Items

We used six multiple-choice science test items from the publicly released Grade 5 STE MCAS(MA DOE, 2003, 2004a, 2005). We selected three test items keyed to one Life Science standardand three test items keyed to one Physical Sciences standard. We selected standards for which testitems appeared consistently on the STE MCAS over the period from 2003–2005. We selected setsof items that showed variation in student performance of 20 percentage points or more across thethree items associated with an individual standard. We hypothesized that this level of variationmight be due to factors independent of the science content of the items. An additional selectioncriterion was general linguistic difficulty of the language of the test items, but item selection wasmade before the items were coded for specific linguistic features in a related study (Roseberyet al., 2012).

Interview Procedure

The interview protocol was intended to explore why each student answered each test item as heor she did. After a student completed all six test items, the interviewer asked a series of questionsabout each item, such as: Which answer did you choose and why?, Were there any other goodanswers?, and Where did you learn about this? Additional questions were asked about vocabularyand illustrations specific to each item, and the interviewer asked follow-up questions when astudent’s answers were not clear or complete. A sample interview protocol is available in theonline version of Noble et al. (2012), along with a more detailed description of the methods usedin this study. Interviews lasted approximately one hour and were videotaped and transcribed. Theinterviews took place entirely in English.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 6: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

252 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

Data Analysis

We defined the target knowledge of a test item as the portion of the science content knowledgedefined by the standard associated with the item (MA DOE, 2006) that was needed to answerthe particular test item. For instance, Physical Sciences Standard 3: “Describe how water can bechanged from one state to another by adding or taking away heat” concerns a range of differentchanges of state of water, while each test item generally involves only one change of state, suchas changing from a liquid to a solid. The target knowledge did not include any science contentknowledge not included in the standard but nonetheless associated with the test item.

We next determined whether each student demonstrated the target knowledge for each itemduring the portion of the interview concerning that item, giving a student a score of 1 or 0,respectively, for demonstrating and not demonstrating the target knowledge completely. No par-tial credit was given. Two coders with backgrounds in science education research independentlyreviewed all student utterances in the interview transcript section concerning a given test item andcoded all the utterances that were related to science content, using NVivo qualitative data anal-ysis software (QSR International, 2008). Their codes were reviewed by two additional coderswith backgrounds in science and science education research, who scored each student as demon-strating or not demonstrating the target knowledge for a given item. Scoring discrepancies werediscussed until consensus was reached.

We compared student scores for item correctness (1 for correct and 0 for incorrect) and demon-strated target knowledge (1 for demonstrated and 0 for not demonstrated) for each test item, andcompiled results across the 36 students and six test items in our sample, for a total of 216 cases.For this exploratory analysis, we treated each answer choice as independent, although we rec-ognize that students’ answers to different items on the same test form are likely to be related.We evaluated the correspondence between item correctness and demonstrated science knowledgeas consistent when a student answered an item correctly and demonstrated the target knowledgefor the item in the interview segment about that item, and when a student answered incorrectlyand did not demonstrate the target knowledge. All other cases were evaluated as inconsistent.We then computed the percentage of total cases that were consistent and inconsistent and thepercentage of each type of inconsistency (answer correct and target knowledge not demonstratedor answer incorrect and target knowledge demonstrated).

We used our evaluation of the correspondence between student answer choices and demon-strated target knowledge as a guide for further investigation of student interview transcripts.We began by creating summaries of the segment of each student’s interview regarding each testitem. Following the model of other researchers who have taken a sociocultural perspective onthe analysis of student interviews regarding assessment items (Hill & Larsen, 2000; O’Connor,2006; Solano-Flores & Nelson-Barber, 2001; Solano-Flores & Trumbull, 2003), we sought tounderstand and describe each students’ path of reasoning for each test item. Two researchers withbackgrounds in science education research collaboratively created a summary of each segmentof each student’s interview. In addition, each transcript section was coded for key characteristicsof student reasoning, such as how students interpreted particular words and phrases within a testitem. Consensus coding was used by two researchers with backgrounds in science and scienceeducation research.

For inconsistent cases, we investigated students’ interview transcripts in more detail. In theseinstances, summaries of transcript sections and full transcripts were shared with a group of

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 7: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 253

researchers with backgrounds in bilingual education, the learning sciences, linguistics, physics,biology, and science education research. This research group iteratively refined the interpreta-tions of student transcripts, with consistent reference to the interview transcripts for justificationof interpretations, and developed explanations for patterns seen across student transcripts (Jordan& Henderson, 1995). Linguistic analyses of the test items also informed the group discussions ofinterview transcripts (Rosebery et al., 2012).

RESULTS

We found inconsistencies between answer correctness and demonstrated target knowledge in 35%of the 216 cases. There were 72 cases (12 students × 6 items) for each subgroup of students, andwe observed inconsistencies in 39%, 33%, and 32% of the cases for the ELL, F/RL non-ELL,and MC non-ELL subgroups, respectively. Inconsistent cases occurred slightly more often forELLs than for either group of non-ELLs. For non-ELLs, the majority of the inconsistencies weredue to students answering correctly but not demonstrating the target knowledge, while for ELLstudents, the majority of inconsistencies were due answering incorrectly, but demonstrating thetarget knowledge. Due to the small size of our sample of students and test items, and the fact thatstudents’ answer choices and interview question responses for different test items are likely not tobe independent, we do not attempt to generalize from these results. Instead, we used these resultsas a guide for further exploration of our qualitative data for the ELL students in our sample,particularly those cases in which ELL students answered incorrectly despite demonstrating thetarget knowledge for the test items. Details of our analysis of the interview data for the other twogroups of students are reported elsewhere (Noble et al., 2012).

Analysis of interview transcripts revealed that in 10 of 18 cases in which ELL studentsanswered incorrectly despite demonstrating the target knowledge, students’ interpretations of thelanguage of the test item were different from the intended interpretations. To illustrate this phe-nomenon, we describe ELL students’ alternative interpretations of the Take Away Heat item inthe next section, and explore the reasons for students’ alternative interpretations in the succeedingsections.

Take Away Heat Test Item

The Take Away Heat test item (Figure 1) is associated with Physical Sciences Standard 3:“Describe how water can be changed from one state to another by adding or taking away heat”

If enough heat is taken away from acontainer of water, what will happen to thewater?

A. It will begin to boil.B. It will become a solid.C. It will turn into a gas.D. It will increase in weight.

FIGURE 1 Take Away Heat MCAS test item (MA DOE, 2004a).

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 8: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

254 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

(MA DOE, 2006), and the target knowledge for this test item is the following: “Taking away heatfrom water can cause liquid water to turn into a solid. Take away heat means cool.” Four ELLstudents answered Take Away Heat incorrectly, despite demonstrating the target knowledge forthe item. All four of these students described an alternative interpretation of the language of thetest item that led them to answer a different question than the one intended.

All four of these ELL students initially thought the item was asking what would happen whenyou heat the container of water, rather than what would happen when you cool it. For instance,one student explained that she picked (a) It will begin to boil, because “when it gets enoughheat, it will begin to boil.” This student correctly interpreted “enough heat” as indicating heatthat meets a threshold needed for a phase change (e.g., liquid turning to gas) to occur, but shethought that enough heat was being added to, rather than being removed from, the container ofwater. Another student used the phrase “take enough heat,” in his interview to mean that the waterwas heating up, or taking heat from a heat source. All four of the students were able to describewhat would happen to water when “enough heat” was transferred to the water: the water wouldbegin to boil, or would turn into a gas. All four students were also able to demonstrate the targetknowledge for the item during the interview, recognizing that to take away heat means to cool andthat cooling the water can lead to freezing. Significantly, the students were able to demonstratethat they understood the words and phrases of this item, but when they initially read the item asa whole, they constructed an alternative interpretation. We conclude that for these four students,their interactions with the language of the test item interfered with their abilities to demonstratewhat they knew about the science of this test item.

Atypical Perspective Feature

As a result of their interpretations of the language of the item, the four ELL students describedin the previous section were answering a different scientific question than the one the item wasintended to ask. In order to better understand why these students constructed alternative inter-pretations of the language of this item, we consider the findings of a study in which 54 grade5 STE MCAS test items were coded for seven linguistic features hypothesized to interfere withELL student performance (Rosebery et al., 2012). In this study, the Take Away Heat test itemwas coded as having the atypical perspective feature, that is, as taking an uncommon or atypicalperspective on the concept or process it describes. A perspective was judged as atypical basedupon its difference from commonly taught perspectives and everyday experiences of situations;the atypical perspective is generally the reverse of the more typical perspective (Rosebery et al.,2012).

In a related study, three popular, National Science Foundation (NSF)–funded elementaryschool science curricula on the topic of the change of state of water were reviewed to determinehow this topic is commonly taught (Noble & Suarez, 2010). This study found that change of stateinvolving heating was more commonly described than change of state involving cooling. In addi-tion, the phrase “heat is taken away,” was included in teacher background materials, but was notsuggested as language to use in the classroom to describe lowering the temperature of water.Instead, the word “cooling” was suggested (Noble & Suarez, 2010, pp. 14–15). The findings ofthis review provide further explanation for students’ alternative interpretations of the language ofthe Take Away Heat test item. As the students worked to comprehend the whole item, they usedspecific words from the item text, such as “take,” “enough,” and “heat” as evidence that the item

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 9: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 255

indicated the typical perspective on heat and a container of water, that is the perspective in whichthe water is being heated. The typical perspective is more commonly taught and more commonlyassociated with the word “heat.” As a result, when these four ELL students read this test item,they constructed an alternative interpretation in accord with the more typical perspective on thissituation.

Out of 28 cases of inconsistency for ELL students, 18 were instances in which ELL stu-dents had incorrect item responses despite demonstrating the science knowledge for the item.Analyzing student responses further, we found that nine of these 18 cases involved interpret-ing a question with the atypical perspective feature (either Take Away Heat or the other item inour set of six with this feature) instead of having the typical perspective. Although some non-ELL students also had difficulty with the atypical perspective feature of these two test items (seeNoble et al, 2012), we found that ELL students’ interpretations of the items with this feature weremore likely to result in incorrect answers despite demonstrated target knowledge. Our results sug-gest that, while the atypical perspective feature may be intended to challenge students to view ascientific process from an alternative perspective, it can interfere with ELLs’ interpretations ofthe language of the test item. The result is that some ELL students answer a different scientificquestion rather than the one the item is intended to ask.

ELLs Who Did Not Demonstrate Target Knowledge

The four ELL students who answered Take Away Heat incorrectly despite displaying the targetknowledge for the item did not immediately recognize the language “heat is taken away” asindicating cooling when they first read the item, although they were able to correctly definethe phrase in their interviews. Thus, we considered the possibility that some students whosetranscripts were coded as not demonstrating the target knowledge may have been unfamiliar withthe phrase “heat is taken away” but otherwise knew how cooling changes the state of water.Knowledge of the phrase “heat is taken away” is clearly targeted by this test item, and is includedin the standard with which the item is associated. Our investigation is not intended to challengethis definition of the target knowledge of this item, but instead to better understand why someELL students did not demonstrate the target knowledge for this item.

To explore this question, we reviewed the interview transcripts of the five ELL studentswho did not demonstrate the target knowledge for Take Away Heat and who also answered theitem incorrectly. These cases were labeled as consistent using our coding method, and thus notreviewed in depth previously. When we analyzed the transcripts of the portions of these fiveinterviews that concerned the Take Away Heat test item, we found that three of the studentsdid not interpret “heat is taken away from a container of water” as describing the cooling ofthe water. These three students had varying interpretations of this phrase. One student explic-itly described his difficulty as follows: “I want to know what like, heat is taken away? Is it like,meaning putting it in a low temperature? Or like boiling it.” This student had initially guessedthat “heat is taken away” meant boiling and had used this interpretation to select an incorrectanswer. Two of these students demonstrated knowledge that cooling water can lead to freezing.We conclude that, for two of the five students who did not demonstrate target knowledge for theitem, the phrase “heat is taken away” presented the main barrier to their demonstrating the targetknowledge and answering the item correctly.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 10: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

256 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

We did not find the same kinds of difficulties with the language “heat is taken away” among thenon-ELL students in our study (Noble et al., 2012). We hypothesize that some of the students inour study did not encounter the phrase “heat is taken away” in school, and that the ELL studentsin our study had fewer resources for interpreting this phrase than their non-ELL peers due toless exposure outside of school to English vocabulary such as the phrasal verb “take away.” Weconclude from this analysis that even in an interview context, and even for ELL students withrelatively high levels of English proficiency, the language of a test item can interfere with theability of ELL students to demonstrate what they know about science.

DISCUSSION

In our study, we examined the responses of ELL students who selected incorrect answers tomultiple-choice items, even though they demonstrated the target knowledge for those items dur-ing an interview. We found that all of these students’ incorrect answer choices resulted fromalternative interpretations of the meaning of the language of the item that led them to answer adifferent scientific question from the one that was intended. Students interpreted the Take AwayHeat item as if it presented a typical perspective on the process described, when in fact it pre-sented an atypical perspective. Looking across the six test items in the study, we found that halfof the incorrect answer choices by ELL students who demonstrated the science knowledge forthe item were answers to the two test items with the atypical perspective feature. Furthermore,we found that all of these answer choices resulted from interpreting the item as if it presented thetypical perspective. These findings suggest that, for some ELL students, scores on science testitems with the atypical perspective feature may not reflect their knowledge of science, but insteadtheir alternative interpretations of the language of the items.

When ELL students answered incorrectly despite demonstrating the target knowledge for theTake Away Heat test item, we found that their challenges were in interpreting the language of theitem as a whole. During the course of their interviews, the students were able to define specificwords and phrases in the item, such as “heat is taken away,” but their reading of the item as awhole had led them to interpret it as taking the typical perspective, and to create an alternativescenario for the item. Alternatively, when we analyzed the responses of the ELL students whoanswered incorrectly and did not demonstrate target knowledge for this test item, most of thesestudents had difficulty interpreting the specific phrase “heat is taken away,” although they werefamiliar with the individual words. Their challenges appeared at the level of this phrase, as wellas at the level of the item as whole. Even when individual words such as “heat,” “taken,” and“away” are familiar to many students, constructing the intended meaning of phrases such as“heat is taken away” may be challenging for some ELLs. As a result, even an interview about thetest item may not capture all that an ELL student knows about a science topic, due to the Englishlanguage demands of demonstrating such knowledge during an interview conducted in English.Interviews conducted with a bilingual interviewer fluent in a student’s home language may reducethe language demands of the interview setting and allow students to display more of their scienceknowledge (Kachchaf et al., 2014; Noble et al., 2014). These findings emphasize the importanceof using multiple measures to assess ELL students’ knowledge of science content, particularlymeasures that take into account ELL students’ levels of English proficiency and minimize thereliance on language (Kopriva et al., 2013).

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 11: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 257

CONCLUSIONS

Collecting validity evidence from the ELL students whose knowledge and skills will be assessedusing a test is an essential part of test validation. Validity evidence based on ELL students’response processes can help us to understand whether students are answering test items basedupon the knowledge targeted by the items, or based upon other factors. Studies such as thisone can help identify test items that are particularly open to alternative interpretations due tothe linguistic features of test items. Further research is needed to describe how ELL studentswith varying levels of English proficiency and varying language backgrounds interact with spe-cific linguistic features of test items. Research on the assessment of ELLs in the United Statesand Francophone students in majority-English provinces in Canada is increasingly exploring theheterogeneity of these language minority populations, and finding that different groups of lan-guage minority students interact differently with assessments, due to differences in linguisticand cultural background (Ercikan, Roth, Simon, Sandilands, & Lyons-Thomas, 2014 [this issue];Kopriva, 2008; Linquanti, 2011; Oliveri, Ercikan, & Zumbo, 2014 [this issue]; Solano-Flores,2014 [this issue]). Future research on ELL students’ response processes should explore the rela-tionship between language and cultural background variables and students’ interactions with testitems.

This study points out the challenges inherent in using multiple-choice test items written inEnglish to measure what ELL students know about science. As we have seen, ELL students donot necessarily interpret multiple-choice test items as item writers intend. However, we havefound that using a student’s answer to a multiple-choice item as the starting point for even a briefconversation about the item can yield significantly more information about the student’s targetknowledge than the answer choice alone. When used routinely in the process of test item devel-opment, this kind of investigation of students’ response processes has the potential to improve thevalidity evidence collected for ELL students, and the quality of test items developed. When usedas a measurement tool, an interview or other non-traditional measure of students’ knowledge canimprove the quality of the information collected about what a student knows about science, andcan help avoid mistaking a student’s knowledge of a particular word or phrase in English for hisor her knowledge of the science content targeted by the test item. The findings of this study serveas a caution against any interpretation of ELL test scores that does not recognize that English lan-guage proficiency is also being tested, and furthermore suggest that the use of multiple measuresof a student’s science knowledge is critical to validity.

ACKNOWLEDGMENTS

The authors thank Josiane Hudicourt-Barnes, Rachel Kachchaf, and Curtis Killian of TERC;Christopher Wright of the University of Tennessee; Michael Russell of Measured Progress; andRachel Kay of the Concord Consortium for their contributions to this work. In addition, theauthors thank the editors of this special issue for their essential feedback on earlier versions ofthis article. The authors also thank the Massachusetts Department of Elementary and SecondaryEducation for making test items and performance data available so that research like ours cantake place. We also thank Catherine Bowler, of the Massachusetts Department of Elementary and

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 12: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

258 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

Secondary Education, whose interest in our work and willingness to help us to understand theMCAS has greatly facilitated this research. Finally, the authors thank all the students, parents,teachers, and administrators who made this work possible. This article is dedicated to them.

FUNDING

This research has been supported by the National Science Foundation, Grant REC-0440180, theInstitute of Education Sciences, U.S. Department of Education Grant R305A110122, and theEducation Research Collaborative at TERC. Opinions, findings, and conclusions and recommen-dations expressed herein are those of the authors and do not necessarily reflect the views of thefunding agencies.

REFERENCES

Abedi, J. (2002). Standardized achievement tests and English language learners: Psychometric issues. EducationalAssessment, 8(3), 231–257.

Abedi, J., Courtney, M., & Leon, S. (2003). Effectiveness and validity of Accommodations for English language learn-ers in large-scale assessments (CSE Report 608). Los Angeles, CA: University of California, National Center forResearch on Evaluation, Standards, and Student Testing. Retrieved from http://www.cse.ucla.edu/products/rsearch.asp

Abedi, J., Courtney, M., Mirocha, J., Leon, S., & Goldberg, J. (2005). Language accommodations for English lan-guage learners in large-scale assessments: Bilingual dictionaries and linguistic modification (CSE Report 666). LosAngeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing.Retrieved from http://www.cse.ucla.edu/products/reports.php

American Educational Research Association, American Psychological Association, & National Council on Measurementin Education. (1999). Standards for education and psychological testing. Washington, DC: American EducationalResearch Association.

Barton, P. E., & Coley, R. J. (2009). Parsing the aAchievement gap II (Policy Information Report). Princeton, NJ:Educational Testing Service Policy Information Center.

Basterra, M. D. R., Trumbull, E., & Solano-Flores, G. (Eds.). (2011). Cultural validity in assessment: Addressinglinguistic and cultural diversity. New York, NY: Routledge.

Center on Education Policy. (2011a). State high school tests: Changes in state policies and the impact of the college andcareer readiness movement. Retrieved from http://www.cep-dc.org/displayDocument.cfm?DocumentID=385

Center on Education Policy. (2011b). State test score trends through 2008–2009, Part 3: Student achievement at 8thgrade. Retrieved from http://www.cep-dc.org/publications/index.cfm?selectedYear=2011

Ercikan, K., Roth, W.-M., Simon, M., Sandilands, D., & Lyons-Thomas, J. (2014). Inconsistencies in DIF Detection forSub-Groups in Heterogeneous Language Groups. Applied Measurement in Education, 27, 273–285.

Gee, J. P. (1996). Social linguistics and literacies: Ideology in discourses (2nd ed.). London, UK: Falmer.Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational

Measurement: Issues and Practice, 23(1), 17–27.Hill, C., & Larsen, E. (2000). Children and reading tests. Stamford, CT: Ablex.Jordan, B., & Henderson, A. (1995). Interaction analysis: Foundations and practice. The Journal of the Learning Sciences,

4(1), 39–103.Kachchaf, R., Noble, T., Rosebery, A., Wang, Y., Warren, B., & O’Connor, M. C. (2014). The impact of discourse

features of science test Items on ELL performance. Paper presented at the annual meeting of the American EducationalResearch Association, Philadelphia, PA.

Kopriva, R. (2008). Improving testing for English language learners. New York, NY: Routledge.Kopriva, R., Winter, P., Triscari, R., Carr, T. G., Cameron, C., & Gabel, D. (2013). Assessing the knowledge, skills, and

abilities of ELs, selected SWDs, and controls on challenging high school science content: Results from randomized

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 13: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

SCIENCE ASSESSMENTS AND ENGLISH LANGUAGE LEARNERS 259

trials of ONPAR and technology-enhanced traditional end-of-course biology and chemistry tests. Retrieved fromhttp://onpar.us/research/reports.html

Linquanti, R. (2011). Strengthening assessment for English learner success: How can the promise of the common corestandards and innovative assessment systems be realized? Cambridge, MA: Rennie Center for Education Research &Policy.

Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. HarvardEducational Review, 78(2), 333–368.

Martiniello, M. (2009). Linguistic complexity, schematic representations, and differential item functioning for Englishlanguage learners in math tests. Educational Assessment, 14(3), 160–179.

Massachusetts Department of Education. (2003). Massachusetts Comprehensive Assessment System: Test questions.Retrieved from http://www.doe.mass.edu/mcas/testitems.html

Massachusetts Department of Education. (2004a). Massachusetts Comprehensive Assessment System: Test questions.Retrieved from http://www.doe.mass.edu/mcas/testitems.html

Massachusetts Department of Education. (2004b). Designation of LEP students: School year 2003–2004. Retrieved fromhttp://www.doe.mass.edu/ell/news04/0325lep.html

Massachusetts Department of Education. (2005). Massachusetts Comprehensive Assessment System: Test questions.Retrieved from http://www.doe.mass.edu/mcas/testitems.html

Massachusetts Department of Education. (2006). Science and technology/engineering curriculum framework. Malden,MA: Massachusetts Department of Education.

Noble, T., Kachchaf, R., Rosebery, A., Warren, B., O’Connor, M. C., & Wang, Y. (2014). Do linguistic features ofscience test items prevent English language learners from demonstrating their knowledge? Paper presented at theannual meeting of the National Association of Research on Science Teaching, Pittsburgh, PA.

Noble, T., & Suarez, C. (2010). Review of three elementary school science curricula and their relationships to two focalMassachusetts state standards (Technical Report Number 1). Cambridge, MA: Chèche Konnen Center. Retrievedfrom http://chechekonnen.terc.edu/publications.html

Noble, T., Suarez, C., Rosebery, A., O’Connor, M. C., Warren, B., & Hudicourt-Barnes, J. (2012). “I never thought of itas freezing”: How students answer questions on large-scale science tests and what they know about science. Journalof Research in Science Teaching, 49(6), 778–803.

No Child Left Behind (NCLB) Act of 2001, Pub. L. No. 107-110, § 115, Stat. 1425 (2002).O’Connor, M. C. (2006). The implicit discourse genres of standardized testing: What verbal analogy items require of

test-takers. In J. Cook-Gumperz (Ed.), The social construction of literacy (pp. 264–287). Cambridge, UK: CambridgeUniversity Press.

Oliveri, M. E., Ercikan, K., & Zumbo, B. D. (2014). Effects of Population Heterogeneity on Accuracy of DIF Detection.Applied Measurement in Education, 27, 286–300.

Perna, L. W., & Thomas, S. L. (2009). Barriers to college opportunity: The unintended consequences of state-mandatedtesting. Educational Policy, 23(3), 451–479.

QSR International. (2008). NVivo (Version 8) [Computer software]. Cambridge, MA: QSR International.Rosebery, A., O’Connor, M. C., Noble, T., Suarez, C., Hudicourt-Barnes, J., & Warren, B. (2012). Understanding chil-

dren’s performance on multiple-choice items in achievement tests in science (Technical Report Number 2). Retrievedfrom http://chechekonnen.terc.edu/publications.html

Sato, E., Rabinowitz, S., Gallagher, C., & Huang, C.-W. (2010). Accommodations for English language learner students:The effect of linguistic modification of math test item sets (NCEE Report Number 2009-4079). Retrieved from http://ies.ed.gov/pubsearch/pubsinfo.asp?pubid=REL20094079

Schafer, W., Wang, J., & Wang, V. (2009). Validity in action: State assessment validity evidence for compliance withNCLB. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 173–194).Charlotte, NC: Information Age.

Sireci, S. G., Han, K. T., & Wells, C. S. (2008). Methods for evaluating the validity of test scores for English languagelearners. Educational Assessment, 13, 108–131.

Solano-Flores, G. (2014). Probabilistic approaches to examining linguistic features of test items and their effect on theperformance of English language learners. Applied Measurement in Education, 27, 236–247.

Solano-Flores, G., & Nelson-Barber, S. (2001). On the cultural validity of science assessments. Journal of Research inScience Teaching, 38(5), 553–573.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4

Page 14: Science Assessments and English Language Learners: Validity Evidence Based on Response Processes

260 NOBLE, ROSEBERY, SUAREZ, WARREN, O’CONNOR

Solano-Flores, G., & Trumbull, E. (2003). Examining language in context: The need for new research and practiceparadigms in the testing of English-language learners. Educational Researcher, 32(2), 3–13.

Solórzano, R. W. (2008). High stakes testing: Issues, implications, and remedies for English language learners. Review ofEducational Research, 78(2), 260–329.

WestEd & Council of State School Officers. (2007). Science assessment and item specifications for the 2009 NationalAssessment of Educational Progress. Retrieved from http://www.nagb.org/publications/frameworks.htm

Wolf, M. K., & Leon, S. (2009). An investigation of the language demands in content assessments for English languagelearners. Educational Assessment, 14(3), 139–159.

Young, J. W. (2009). A framework for test validity research on content assessments taken by English language learners.Educational Assessment, 14(3–4), 122–138.

Young, J. W., Steinberg, J., Cline, F., Stone, E., Martiniello, M., Ling, G., & Cho, Y. (2010). Examining the valid-ity of standards-based assessments for initially fluent students and former English language learners. EducationalAssessment, 15(2), 87–106.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

3:15

10

Oct

ober

201

4