empirical validation of reading proficiency guidelines · mean difficulty between or among test...

17
Empirical Validation of Reading Prociency Guidelines Ray Clifford Brigham Young University Troy L. Cox Brigham Young University Abstract: The validation of ability scales describing multidimensional skills is always challenging, but not impossible. This study applies a multistage, criterionreferenced approach that uses a framework of aligned texts and reading tasks to explore the validity of the ACTFL and related reading prociency guidelines. Rasch measurement and statistical analyses of data generated in three separate language studies conrm a signicant difference in reading difculty between the prociency levels tested. Key words: ACTFL Prociency Guidelines, multistage assessment, procient reading, scale validation, testing reading Introduction Language prociency scales grew out of the pragmatic need to functionally describe what language learners can do in a realworld context of language use. While many academics acknowledge that the productive skills such as speaking and writing can fall into a hierarchical order of difculty, there has been less consensus with the receptive skills. Occasionally, reallife examples support the idea that various texts can be ordered by text complexity and difculty. For instance, a junk mail postcard received by one of the authors provided a realworld example of texts that were written for different purposes and, commensurate with those purposes, used observably different writing styles. On the front of the card was printed in a large, bold font, the message: No cost. No obligation. You have denitely won.On the other side of the card, printed in a miniscule font was the clarication, Should a legal adjudication be made that valid consideration was transferred or expended by offeree pursuant to the specic terms of this written offer, then in such event restitutory indemnication shall be accomplished hereunder upon written request.Ray Clifford (PhD, University of Minnesota) is Associate Dean, College of Humanities, and Director, Center for Language Studies, Brigham Young University, Provo, Utah. Troy L. Cox (PhD, Brigham Young University) is Technology and Assessment Coordinator at the English Language Center, Brigham Young University, Provo, Utah. Foreign Language Annals, Vol. 46, Iss. 1, pp. 4561. © 2013 by American Council on the Teaching of Foreign Languages. DOI: 10.1111/flan.12033 Foreign Language Annals VOL. 46, NO. 1 45

Upload: tranngoc

Post on 02-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

Empirical Validation of ReadingProficiency GuidelinesRay CliffordBrigham Young University

Troy L. CoxBrigham Young University

Abstract: The validation of ability scales describing multidimensional skills is alwayschallenging, but not impossible. This study applies a multistage, criterion‐referencedapproach that uses a framework of aligned texts and reading tasks to explore the validityof the ACTFL and related reading proficiency guidelines. Rasch measurement andstatistical analyses of data generated in three separate language studies confirm asignificant difference in reading difficulty between the proficiency levels tested.

Key words: ACTFL Proficiency Guidelines, multistage assessment, proficient reading,scale validation, testing reading

IntroductionLanguage proficiency scales grew out of the pragmatic need to functionally describewhat language learners can do in a real‐world context of language use. While manyacademics acknowledge that the productive skills such as speaking and writing canfall into a hierarchical order of difficulty, there has been less consensus with thereceptive skills. Occasionally, real‐life examples support the idea that various textscan be ordered by text complexity and difficulty. For instance, a junk mail postcardreceived by one of the authors provided a real‐world example of texts that werewritten for different purposes and, commensurate with those purposes, usedobservably different writing styles. On the front of the card was printed in a large,bold font, the message: “No cost. No obligation. You have definitely won.” On theother side of the card, printed in aminiscule font was the clarification, “Should a legaladjudication be made that valid consideration was transferred or expended by offereepursuant to the specific terms of this written offer, then in such event restitutoryindemnification shall be accomplished hereunder upon written request.”

Ray Clifford (PhD, University of Minnesota) is Associate Dean, College ofHumanities, andDirector,Center forLanguageStudies,BrighamYoungUniversity,Provo, Utah.Troy L. Cox (PhD, Brigham Young University) is Technology and AssessmentCoordinator at the English Language Center, Brigham Young University, Provo,Utah.Foreign Language Annals, Vol. 46, Iss. 1, pp. 45–61. © 2013 by American Council on the Teaching of ForeignLanguages.DOI: 10.1111/flan.12033

Foreign Language Annals � VOL. 46, NO. 1 45

Page 2: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

Although some researchers have as-serted that there is no hierarchical order ofdifficulty for reading proficiency (Alderson,2000; Brown & Hudson, 2002; Lee &Musumeci, 1988), scales describing levelsof reading proficiency have functioned wellin both government and academia. There areseveral possible explanations for the paucityof empirical evidence. First, the construct ofreading proficiency may have been inade-quately defined so that the data producedwere not related to the level descriptions.Second, the item writers may not have beentrained to select reading passages that alignwith the different levels of the scale and,equally important, did not write questionsthat aligned with the passages and the scaletask descriptions. Finally, the underlyingmeasurement theory used in assessingreading ability did not define a priori whatunidimensional trait was being measured,and the research did not generate interval‐level data that would be appropriate for usewith parametric statistics.

To investigate the validity of thescales, we asked the following researchquestion:

To what extent do test items ascend in ahierarchy of difficulty levels when boththe passage and question are based onthe ACTFL Reading Proficiency Guide-lines? The associated null hypothesiswould be that there is no difference inmean difficulty between or among testitems written to assess the Intermediate,Advanced, and Superior levels of read-ing proficiency.

BackgroundThe following report on this research isdivided into a number of steps. First, wereview the reading proficiency guidelinesand what they entail. Second, we reviewprevious research. Third, we discuss theimportance of choosing a measurementtheory that aligns with the reading profi-ciency construct. Finally, we discuss howthe guidelines can be operationalized for

item writers through (1) carefully definingthe construct of reading proficiency and (2)ensuring that for each item there is align-ment between the targeted level, the selectedpassage, and reader task.

Reading Proficiency GuidelinesAs with the speaking guidelines, the ACTFLReading Proficiency Guidelines grew out ofthe Interagency Language Roundtable (ILR)guidelines that were established in the 1950sto create a “system that was objective,applicable to all languages and all CivilService positions, and unrelated to anyparticular language curriculum” (Herzog,n.d., para. 3). While the scale originallyencompassed language proficiency as aunitary construct, it was later divided intothe four skill areas of reading, writing,listening, and speaking. The intent was toprovide a scale that stakeholders in variousbranches of government, with little or nolanguage training, could use in makingpersonnel assignments. In addition to serv-ing as the foundation for the ACTFL scales,the ILR scale was also the basis for theNATO Standardization Agreement (STA-NAG) 6001 scales (STANAG 6001, 2010).While the major levels of the three scaleshave remained constant, the sublevels aredifferent. STANAG and ILR may apply a“plus” sublevel rating, whereas ACTFLdivides major levels into three parts: low,mid, and high. The ACTFL guidelines wererecently updated, and sublevels were de-fined for the Advanced level. The introduc-tion to the new guidelines states:

The ACTFL Proficiency Guidelines2012—Reading describe five majorlevels of proficiency: Distinguished,Superior, Advanced, Intermediate, andNovice. The description of each majorlevel is representative of a specific rangeof abilities. Together these levels form ahierarchy in which each level subsumesall lower levels. … By describing thetasks that readers can perform withdifferent types of texts and under

46 SPRING 2013

Page 3: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

different types of circumstances, theReading Proficiency Guidelines de-scribe how [well] readers understandwritten texts. (ACTFL, 2012, p. 20)

Previous Research on the GuidelinesPerhaps because the productive skills ofspeaking and writing have been moreobservable, less attention has been given tothe testing of second language receptiveskills than to the testing of second languagespeaking proficiency. While the ACTFLSpeaking Guidelines have been operational-ly validated (Surface & Dierdorff, 2003),that is not yet the case with the ACTFLReading Guidelines. In fact, previous studiesand academic reviews have questioned thevalidity of both the ACTFL Reading Profi-ciency Guidelines and the ILR LanguageProficiency Guidelines (2012) from whichthe ACTFL guidelines were extrapolated.Despite the success that federal agencieshave had in testing reading proficiencyusing the ILR proficiency guidelines, re-searchers have not statistically validated thedifficulty hierarchy posited by those relatedsets of proficiency guidelines.

Specifically, Alderson (2000) concludedthat there is “no empirical evidence tovalidate the a priori definitions of levels”(p. 279; italics in original). Brown andHudson (2002) stated that although theACTFL guidelines have been frequentlydescribed as criterion‐referenced, there hasbeen circularity in the descriptions of thetexts and the reader’s ability. Bernhardt(2011) added the insight that the assess-ment of reading comprehension has beencomplicated by readers’ use of compensatoryprocessing strategies. In tests of the readingabilities of Italian learners, Lee and Musu-meci (1988) concluded that the text typesdescribed by Child (1987) did not form ahierarchy of increasing difficulty and thattheir students’ performance was not consis-tent with the hierarchy of reading skillsdescribed in the ACTFL proficiency guide-lines. Despite these criticisms, the ILRlanguage proficiency descriptions continueto be used in high‐stakes testing within the

federal government, as well as the relatedSTANAG 6001 proficiency descriptionsused by NATO.

In the attempt to reconcile the apparentcontradiction between real‐world practicesand research results, three studies wereparticularly valuable. First, Lee (2001)pointed out that much of the debate aboutthe feasibility of classifying texts intodistinguishable categories may be attributedto the imprecision of the categorical defi-nitions used. In a massive analysis of thetypes of language texts found in the BritishNational Corpus, he found that professionallinguists did not always agree on thedefinitions of text classification terms suchas genre, text type, style, and register. Healso noted that differences in interpretationwere of little consequence, because whenreal‐world texts were categorized, the vari-ous classification categories were interrelat-ed and were naturally co‐selected. Genresare often recognized by their typical texttypes, and texts selected based on text typeare often predominantly from a particulargenre.

Second, it was important to recognizethat the challenge of dealing with multidi-mensional traits is not unique to languageassessment. The field of cognitive sciencemust also cope with the assessment ofcomplex, multidimensional mental skillssuch as those encountered when attemptingto accurately measure reading ability—an ability that appears to be composed of aconstellation of skills that are employeddifferently by different readers as they readtexts of various types, on various topics, forvarious purposes. Luecht (2003), a cognitivespecialist, provided an example of how theassessment of language proficiency may beaccomplished through the careful alignmentof theoretical constructs, proficiency classi-fications, and assessment methods. Hisrecommendation was to see if one canmeasure major steps in skill developmentinstead of attempting to assess all the develop-mental profiles that one may encounter.

Third, test method has long beenrecognized as an important and often

Foreign Language Annals � VOL. 46, NO. 1 47

Page 4: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

outcome‐determining variable that can con-found research results (Clifford, 1981). In amore recent study, Rupp, Ferne, and Choi(2006) found that using multiple‐choicequestions to assess reading comprehensioncould alter the construct being tested.Especially when the difficulty of the passageto be read exceeded the readers’ ability, thereaders ceased reading and shifted toproblem‐solving tactics as they attemptedto deduce the most likely answer from theoptions presented.

Clearly, multidimensionality of lan-guage proficiency, the compensatory natureof reading skills, and the complexity oftesting procedures have made the assess-ment of reading proficiency a dauntingchallenge in both first and second languageresearch. Smith (2007) noted in the fore-word to the sixth edition of UnderstandingReading that contradictory models andexplanations continue to flourish in thearea of reading research.

Choice of Measurement TheoryThe psychologist Kurt Lewin stated, “Thereis nothing more practical than a goodtheory” (1952, p. 169), yet many socialscientists have not considered the theoreti-cal basis of what they have measured andhow they have measured it. First, we discussthe importance of defining the dimensional-ity of the trait. Then we discuss theimportance of interval data when usingparametric statistics to assess or describe ahuman trait.

Reading as a Unidimensional ConstructMost constructs in the social sciences aremultidimensional, yet the instruments usedto measure those constructs assume that thetrait being investigated is unidimensional.For a trait to be unidimensional, theunderlying assumption is that the abilityone is measuring lies on a single axis likeweight or length. In the case of length, it ispossible to measure a child’s height as he orshe progresses from being an infant toreaching adulthood and use a predeter-

mined standard of measure (e.g., centi-meters or inches) to know when any growthoccurred. In the context of measuringreading, no “ruler” exists, so researchersmust either create or choose a test to act asthe standard of measure. Despite the factthat most test theories in use are based onthe assumption of unidimensionality(Ip, 2010), many researchers create instru-ments without an a priori definition of howthey are treating the trait in a unidimen-sional way and are unaware of the measure-ment implication of that omission.

For example, when discussing the sizeof a shirt, it is possible to discuss itmultidimensionally, bidimensionally, andunidimensionally. Treating a shirt as amultidimensional object requires measuringneck size, arm length, torso width, andlength. Thesemeasurements allow for a shirtto be made with a fit that is ideal for oneindividual, but that size or constellation ofmeasurements is so personalized that it isusually not useful in generalizing to thebroader population. Often, dress shirt sizesare treated bidimensionally by measuringtwo dimensions: neck size and arm length.Using the two measurements in tandem canstill result in shirts that fit an individualfairly well, but there are well over 20different neck‐sleeve combinations. Thesetwo measures reduce the complexity to adegree that is beneficial to the customer, butthe sheer number of combinations can stillbe too complex for describing a populationof shirt customers. Therefore, shirt size canbe treated as a unidimensional object withan a priori definition of what the sizeconstruct is. As neck size increases, typicallyarm length and torso length increase aswell, so shirt sizes have been operationalizedwith the labels “small,” “medium,” and“large.” The likelihood of a shirt fittingeach individual well decreases when treatedunidimensionally, yet the practicality ofbeing able to generalize to broader popula-tions can make this an attractive alternative.This shirt analogy lends itself nicely tothe discussion of the proficiency scales asthey reflect an amalgamation of traits

48 SPRING 2013

Page 5: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

that co‐occur at levels of increasingdifficulty.

Reading requires many skills thatinteract, yet the theory upon which mostresearch has been conducted has notaddressed that complexity and has treatedall combinations of questions, questiontypes, and passages as tapping a unidimen-sional construct. However, if one is to treatthe reading skill as unidimensional, it isnecessary to carefully define the construct ina way that the contributing skills co‐occurin specific alignments as the trait increasesin complexity. As with the shirt analogy, inwhich a shirt does not qualify as a sizemedium unless it has the minimum neck,sleeve, chest, and torso dimensionsexpected of a medium, a person does notqualify as an Advanced‐level reader unlesshe or she consistentlymeets all of the criteriadescribed for that level. Thus to testsomeone at the Advanced proficiency levelrequires the use of test items in which theskills needed to complete the Advancedreading tasks co‐occur. For example, wecould discuss individual components suchas reading rate or knowledge of sight words,text characteristics, or reader task; however,the proficiency guidelines contain a unidi-mensional construct that is an amalgamationof the criteria that co‐occur at each specificlevel of the reading proficiency scales.

Measuring Unidimensional ConstructsIn classical test theory, an examinee’s abilityis estimated by the total number of testquestions answered correctly, and an item’sdifficulty is calculated by dividing thenumber of examinees who answer the itemcorrectly by the total number of examinees.With this classical approach, both the abilityscore and the item difficulty index measuresare dependent on which examinees take thetest and on the difficulty of the itemsincluded in the test. A further concern isthat for the scores to be considered intervaldata, the distance between any two scoresmust be equidistant, and any increase intotal score should represent the sameincrease in ability regardless of where it

falls within the range of the observeddifficulties. Thus, if one examinee took apretest and had a score of 10, and later took aposttest and had a score of 15, it would beassumed that the examinee gained in skill byfive points. To be interval data, that abilityincrease of five should have the samemeaning whether it is from two to sevenor from 18 to 23, yet most educators wouldagree that the amount of growth betweenthose scores is different. Most social scien-tists acknowledge that the data from theirtest scores are not truly interval data and arein fact ordinal, but some ignore thisrequirement and still use parametric statis-tics in answering their research questions.However, because each of the easiest andeach of the most difficult items contributethe same value to the examinee’s total score,it is unlikely that the resulting scores havethe properties of interval data.

A better approach is to use Raschmeasurement. When a Rasch model ofitem response theory measurement isused, the raw scores are converted tointerval‐level data and the parameter esti-mates for both person ability and itemdifficulty have the quality of measurementinvariance (Engelhard, 2008). That is, whenusing Rasch statistical analyses to measure aunitary construct, person ability estimatesare the same regardless of the difficultyof the items that are presented to theexaminees, and the item difficulty estimatesare the same regardless of the ability levelsof the examinees who respond to them.These properties are highly valuable whentrying to validate a criterion‐based scale;they also address a major criticism ofresearch in the human sciences that thedata have been treated as interval datawhen they are more likely ordinal (Bond& Fox, 2007).

When Rasch calculations transformperson ability and item difficulty estimatesinto interval data, the resulting measure-ment units are called logits. Logits (or logodds ratios) are the natural logarithm ofodds ratios of success. By being transformedto a log odds ratio, the measures are now

Foreign Language Annals � VOL. 46, NO. 1 49

Page 6: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

interval data, have additive properties, andcan then be transformed back into probabil-ities (Linacre, 1991). Georg Rasch, theDanish mathematician who developed themeasurement model, stated:

In simple terms, the principle is that aperson having a greater ability thananother person should have the greaterprobability of solving any item of thetype in question, and similarly, one itembeingmore difficult than another meansthat for any person the probability ofsolving the second item is the greaterone. (Rasch, 1960, p. 117)

With Rasch measurement, person abili-ty and item difficulty are measured con-jointly so that an examinee with a personability estimate of a given logit value willhave a 0.50 probability of correctly answer-ing an item with a difficulty parameter ofthat same value. For instance, if an examineehas a person ability estimate of 1.00 logitsand a question has an item difficultyparameter of 1.00 logits, the probabilitythat that person will respond to that promptcorrectly is 1 to 1 for 50–50 odds ofanswering correctly. The result of thistransformation is that if an examinee has apretest with a logit of�0.50 and a posttest of0.25, the trait ability has increased by 0.75logits. As logits are interval, that 0.75difference would indicate the same differ-ence in ability, be it from �3.00 to �2.25 orfrom 1.75 to 2.50.

Finally, Rasch measurement providesadditional tools to use in determining thereliability of test scores. Reliability is theratio of the true variance to the observedvariance. Unlike classical test theory thatonly reports a single reliability value acrossall of the items and examinees in a testsetting (e.g., Cronbach’s alpha or Kuder‐Richardson 20), Rasch reliability reports therelative reproducibility of the test’s results.Furthermore, Rasch reliability providesthose estimates for every facet that is beingmeasured. When the reliability is close to1.0, it indicates that the observed variance of

what is being measured (person or item) isclose to or nearly equivalent to the immea-surable true variance. Thus, if personreliability is close to 1.0, it means thedifferences in examinee scores are due todifferences in examinee ability. If the itemseparation reliability is close to 1.0, it is anindication that differences in the itemdifficulty parameters are due to differencesin item difficulty.

Operationalizing the Guidelines forItem WritersIn order to create tests that could validate thescales for this study, items needed to bewritten that were both based on the guide-lines and aligned with the construct’sunidimensional definition. In order toaccomplish this, there needed to be adefinition of proficient reading that spannedthe entire scale, and itemwriters also neededtraining on how to select texts and writeitems that aligned with the scale.

Proficient ReadingProficient reading was defined as theactive, automatic, far‐transfer process ofusing one’s internalized language and cul-ture expectancy system to efficiently com-prehend an authentic text for the purposefor which it was written (see Figure 1). Thisdefinition of proficient reading can beapplied to each ACTFL proficiency leveland allows the definition of multistagemanifestations of the unidimensional traitthat conform to developmental theory uponwhich the guidelines are based.

FIGURE 1

Definition of Proficient Reading

Proficient reading: The active, automatic, far-transfer process of using one’s internalized language and culture expectancy system to efficiently comprehend an authentic text for the purpose for which it was written.

50 SPRING 2013

Page 7: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

Aligning Task, Condition, and Accuracy tothe ScaleThat definition of proficient reading wasthen applied to each level described in theproficiency guidelines to ascertain the tasksand conditions that would be assessed ateach level. Following Luecht’s (2003) rec-ommendation that one should define majorstages or levels of progress along thecontinuum of language ability, each stageor step was defined as a separate perfor-mance standard with its own combination ofaligned communication expectations. Thesespecifications are summarized in Figure 2,where the author’s purpose, based onChild’s

(1987) text modes, is aligned with the typesof texts typically used to accomplish thosetasks and where the reader’s task is alignedwith the author’s purpose for creating thetext. The Novice level is not included in thefigure, as it is characterized by the reader’sinability to sustain the Intermediate level.The Distinguished level is included toprovide a ceiling for the Superior level, butat this point there have been no require-ments to test at that level.

As with the other skill modalitiesdescribed in the ACTFL Proficiency Guide-lines, the reading proficiency ratings arenoncompensatory. To qualify for a given

FIGURE 2

Overview of Purpose, Text, and Task Alignment by Level

Level Author Purpose Text Type Reader Task

Inte

rmed

iate

1

Orient by communicating main ideas.

Simple, short sentences with simple vocabulary. Often, sentences may be resequenced without changing the meaning of the text. Text organization is loose without much cohesion, but follows societal norms.

Orient oneself by identifying the main topics, ideas, or facts.

Adv

ance

d 2

Instruct or inform by communicating organized factual information.

Connected factual discourse with compound and complex sentences dealing with factual information. Sentences are sequenced within cohesive paragraphs, but some paragraphs may be resequenced without changing the meaning of the text. The identity of the author is not important.

Understand not only the central facts, but also the supporting details such as temporal and causative relationships.

Supe

rior

3

Evaluate situations, concepts, and conflicting ideas; present and support arguments and/or hypotheses with both factual and abstract reasoning, using language that is often accompanied by the appropriate use of wit, sarcasm, irony, or emotionally laden lexical choices.

A multiple-paragraph block of discourse on a variety of unfamiliar or abstract subjects such as might be found in editorials, official correspondence, and professional writing. References may be made to previous paragraphs, external events, common cultural values, etc. The “voice” of the author is evident.

Learn by relating ideas and conceptual arguments. Comprehend the text’s literal and figurative meanings by reading both “the lines” and “between the lines” to recognize the author’s tone and infer the author’s intent.

Dis

tingu

ishe

d 4

Project lines of thought beyond the expected, connect previously unrelated ideas or concepts, and present complex ideas with nuanced precision and virtuosity—often with the goal of propelling the reader into the author's world of thought.

Extended discourse that is tailored for the message being sent and for the intended audience. To achieve the desired tone and precision of expression, the author will often demonstrate the skillful use of low-frequency vocabulary, cultural and historical insights, and an understanding of the audience's shared experiences and values.

Read “beyond the lines” to understand the author’s sociolinguistic and cultural references, follow innovative turns of thought, and interpret the text in view of its wider cultural, societal, and political setting.

Foreign Language Annals � VOL. 46, NO. 1 51

Page 8: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

rating, the individual must consistentlymeet all of the construct criteria for thatlevel (Swender & Vicars, 2012, p. 5). Inassessing reading proficiency, this require-ment means that the reader must be able toconsistently comprehend texts of the speci-fied type for the purpose for which they werecreated. Stated another way, the reader mustsuccessfully accomplish those comprehen-sion tasks that are aligned with the author’spurpose. Blended or nonaligned combina-tions of reader and author purpose arepossible, and when they occur, they may beuseful in assigning sublevel ratings. How-ever, nonaligned combinations are notuseful when assigning major‐level proficien-cy ratings where consistent performanceacross the aligned factors for each level isexpected. Thus, testing whether the readercan perform lower‐level reading tasks onhigher‐level text passages provides insuffi-cient information to assign a proficiencyrating. For example, understanding themain idea (an Intermediate‐level task) of aSuperior‐level text would not qualify thereader for a Superior or even an Advancedproficiency rating. Again, the expectation ofthe reading guidelines is that the author’spurpose and the reader’s task are congruent.If the author’s purpose is to narrate a story,then the reader’s purpose is to comprehendthe details of that narration.

It should be noted that this restriction ofthe tested tasks to those that align with theauthor’s purpose distinguishes assessmentpractices from teaching practices, becauseproficiency testing places the focus onwhether readers can comprehend textsrather than on the instructional focus ofhow readers comprehend texts at each of thetested levels.

Background SummaryThe reading proficiency scales have been inuse for a number of years as a practical wayto describe overall language ability level, yetlittle research has been found to validatetheir use. The research that has beenconducted seems to contradict the practicalfindings of those who use the guidelines as

the basis for their tests. The reasons for thefailure to validate operational experiencecould come from a number of factors,including a failure to understand the scale,a failure to define the construct unidimen-sionally, applying an inappropriate mea-surement theory, and not having a cleardefinition of proficient reading that spans alllevels of the scale. In order to research theextent to which test items (where both thepassage and the question are based on theACTFL Reading Proficiency Guidelines)ascend in a hierarchy of difficulty levels,those factors would need to be controlled.

MethodTo investigate the validity of the readingscales and to answer the research questionabout the extent to which test itemscarefully based on the ACTFL ReadingProficiency Guidelines ascend in a hierarchyof difficulty levels, reading test itemstargeted at the major levels of Intermediateand higher were created in three differentlanguages (English, Chinese, and Spanish).

The item‐writing process includedtraining item writers to create items thataligned with the scale. Once developed, thetest development team reviewed the itemsfor alignment with the targeted proficiencylevel and trialed with a small representativesample of examinees. The final step wasempirical testing of the items to determinewhether their statistical difficulties clusteredby level and were arrayed in the hypothe-sized order. The process was conducted firstin English for reading tests requested byNATO and was later replicated in Chineseand Spanish. A description of each instru-ment, the number of questions targetingeach major proficiency level, and thesubjects who took the test are presented inthe Results section below. Data analysisprocedures were chosen based on the type ofdata obtained from the administrations ofeach test.

Training the Item WritersTo see if test items could be developed thataligned with the ACTFL/ILR/NATO reading

52 SPRING 2013

Page 9: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

proficiency guidelines, the following proce-dure was followed in the three differentlanguages. First, item writers were taughtabout the construct of reading proficiencythat would be applied. Second, they werepresented with the proficiency scales andshown how the task, conditions, andaccuracy criteria were manifest at each level.Third, they were taught to select appropriatetexts for each level, to write questions thataligned the reader’s task with the author’spurpose, and to create plausible responsedistractors that aligned with each level’sexpectations.

Internal ValidationBefore the test itemswere administered, theywent through another review process wherethe item writers were asked to ensure thatthe questions being asked were alignedwith the author’s communicative purpose,that the reading passage was aligned withthe typical characteristics of texts used forthat purpose, and that the intended difficul-ties of that combination corresponded to alevel in the proficiency scale. As a vehicle tostructure that review, the item writers wereasked to look at each item and to estimateeach item’s:

� At‐level difficulty� Difficulty for those with proficiency at onelevel lower than the item

� Difficulty for learners at one level higherthan the item

� Discrimination index across levels

Any item judged to lack at‐level align-ment was revised. The items were thenadded to the pool of items to be included inthe study and subsequently administered tolearners of varying ability levels.

Data Analysis ProceduresThe results were then analyzed using thesoftware program WINSTEPS (Linacre,2012), which is based on the Raschmeasurement. As noted earlier, the datafrom Rasch measurement are interval‐level

and can be analyzed using parametricstatistics. Demonstrating that the proficien-cy levels in the guidelines represent ahierarchy of stages or steps of increasingdifficulty requires that three conditions bemet:

� The items designed to measure specificproficiency levels should cluster at theirintended difficulty levels.

� The mean difficulty of each of those itemclusters should be statistically differentfrom the mean difficulty values of theother clusters of items.

� The mean values of those item clustersshould be arrayed in a hierarchy ofincreasing difficulty.

After each test was analyzed for reliabil-ity through person and item separationstatistics with Rasch measurement, thefollowing procedures were used to test forthese conditions. For the English andChinese studies, the statistical test used todetermine if the intended item difficultylogits were different by intended proficiencylevels was a one‐way ANOVA. The compar-isons between levels were reported using aBonferroni posthoc test (Larson‐Hall, 2010)that defined the mean difference, its 95%confidence interval (CI), and the p value foreach comparison. Because the specificationsfor the Spanish test included only twoproficiency levels, an independent samplest test was used with those data.

The null hypothesiswas that therewouldbe no difference in the mean difficultybetween the groups. To see how importantthe intended level of item difficultywas on thedifferences between the mean difficulty of theitem clusters, the effect size was calculatedand reported with Cohen’s d, which indicatesthe number of standard deviations thatseparate the means being compared.

Language Study ResultsItems were created for tests in threelanguages: English, Chinese, and Spanish.The tests were created for different projects

Foreign Language Annals � VOL. 46, NO. 1 53

Page 10: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

and thus the structure of each of the testswas slightly different; however, the itemcreation process was consistent across all ofthe languages. These tests were thenadministered to different groups of exam-inees and the Rasch measurement statisticswere calculated for each language test.

English StudyThe English test was originally created forNATO to assess whether personnel hadsufficient English skills to succeed in amultinational work environment whereEnglish was the common language. TheNATO version of the test was finished in2009, and in 2010 it was administered to 182NATO personnel from 12 different nations.The test was later administered to an evengreater number of students in universitysettings. The test for this study consisted ofthree ACTFL proficiency levels: Intermedi-ate with 24 items, Advanced with 21 items,and Superior with 19 items. Items weredrawn at random, and each examinee took20 of the 24 items at the Intermediate level,20 of the 21 items at the Advanced level, andall119 items at the Superior level. The testwas administered to a total of 581 examineesfrom two distinct populations: personnelassociated with NATO and students en-rolled in U.S. universities.

The English test had a Rasch IRTperson reliability of 0.87, indicating arelatively high level of internal consistency.This level of reliability confirmed that theexaminees could be reliably divided intothree or four statistically distinct groups

(Linacre & Wright, 2009). The item reli-ability of 0.99 was very high and indicatedthat the items functioned at distinctlyseparate levels of difficulty (Linacre &Wright, 2009). The means of each of theintended difficulty levels progressed mono-tonically (see Table 1).

Comparisons using Bonferroni’s con-trasts found statistical differences betweenthe Intermediate and Advanced items (meandifference ¼ �1.47 logits, a 95% CI be-tween �1.83 and �1.10, and p < 0.001)and between the Advanced and Superioritems (mean difference ¼ �1.23 logits, a95% CI between �1.61 and �0.84, andp < 0.001). Figure 3 shows the differencesbetween the levels. Although some of theitems between the levels had difficulty logitsthat did overlap, as a whole the items by levelwere statistically different. The null hypoth-esis that the difference in the means betweenthe groups was zero could be rejected for allcomparisons. The effect size among thethree levels was very strong between thelevels as they progressed: for Intermediateitems vs. Advanced items, Cohen’sd ¼ 2.26, and for the Advanced items vs.Superior items, d ¼ 2.09. Thus the intendedproficiency level of the items had a strongeffect on the empirical item difficulty level.

Chinese StudyThis test was initially designed for studentsenrolled in U.S. universities studying Chi-nese as a foreign language. The subjects were211 students from 10 different schools. Thetest had 70 items that were divided into

TABLE 1

Descriptive Statistics of English Item Difficulty by Intended Proficiency Level

# of Items Mean Logit Value SD

Intermediate 24 �1.28 0.63Advanced 21 0.19 0.69Superior 19 1.42 0.46Total items 64

54 SPRING 2013

Page 11: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

the three proficiency levels: Intermediate,Advanced, and Superior. The items wereadministered adaptively in five‐item testlets,and a sufficient number of students an-swered questions at all levels to calculatecomparative statistics. There were six test-lets (a total of 30 items) for both theIntermediate level and the Advanced level,and two testlets (a total of 10 items) for theSuperior level. Students needed to provideevidence of sustained performance at onelevel before they were presented testlets at ahigher level. Students only answered asubset of the testlets based on their perfor-mance at each level, and the greatest numberof items that any student encountered was55. The minimum number of items anystudent would encounter was 15, and in thatinstance, all of the items were at theIntermediate level.

The test had a fairly high internalconsistency with a Rasch person reliabilityof 0.83, indicating that the examineescould be reliably divided into two or threestatistically distinct groups (Linacre &

Wright, 2009). The item reliability of 0.92was very high and indicated that the itemsfunctioned at distinctly separate levels ofdifficulty. The means of each of the intendeddifficulty levels progressed monotonically(see Table 2).

Comparisons using Bonferroni’s con-trasts found statistically significant differ-ences between the Intermediate andAdvanced items (mean difference ¼ �1.44,with the 95% CI between �2.35 and �0.53and p < 0.01), and the Advanced andSuperior items had a mean difference of�2.14, with a 95% CI between �3.42 and�0.86 and p < 0.001. Figure 4 shows thedifferences between the levels. AlthoughIntermediate‐level items had a slightlyskewed distribution, it was not so muchthat parametric statistics were inappropri-ate. As with the English test, some of theitems between the levels had difficulty logitsthat overlapped, but as a whole, the items bylevel were statistically different. The nullhypothesis that the difference in the meansbetween each of the groups was zero could

FIGURE 3

Boxplot Distribution of English Item Difficulty by Intended Proficiency Level

Foreign Language Annals � VOL. 46, NO. 1 55

Page 12: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

be rejected. The effect size among the threelevels was very strong: for Intermediateitems vs. Advanced items, d ¼ 0.98, and forthe Advanced items vs. Superior items,d ¼ 2.50. Thus the intended proficiencylevel of the items had a strong effect on theempirical item difficulty level.

Spanish StudyThis test was initially designed for studentsenrolled in U.S. universities studying Span-

ish as a foreign language and was designedto measure only through the Advancedlevel. The subjects were 550 students fromfour different schools. The test had 57 itemsthat were divided into the two proficiencylevels: Intermediate and Advanced. The testwas administered in two forms (A and B)that each had 34 items, including 14Intermediate‐level items and 20 Advanced‐level items. Each form shared five Interme-diate items and five Advanced items for atotal of 10 anchor items on both forms.

TABLE 2

Descriptive Statistics of Chinese Item Difficulty by Intended Proficiency Level

# of items Mean logit value SD

Intermediate 30 �1.22 1.93Advanced 30 0.22 0.83Superior 10 2.35 0.86Total 70

FIGURE 4

Boxplot Distribution of Chinese Item Difficulty by Intended Proficiency Level

56 SPRING 2013

Page 13: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

The test had a Rasch person reliabilityof 0.80, indicating that there was a relativelyhigh level of internal consistency andthat the examinees could be reliablydivided into two to three statisticallydistinct groups. The item reliability of0.98 was very high and indicated thatthe items functioned at distinctly separatelevels of difficulty. The mean of theAdvanced items was much higher thanthe mean of the Intermediate items (seeTable 3).

An independent samples t test betweenthe Intermediate and Advanced items was

conducted to determine if the two groups ofitems differed in item difficulty. An exami-nation of the data indicated that while theIntermediate items were normally distribut-ed, the Advanced items were slightly skewed(see Figure 5); thus Welch’s procedure wasused. The 95% CI for the difference in themeans was between �2.20 and �1.50(t ¼ �9.99, p < 0.001, df ¼ 38.9). Thenull hypothesis that the difference in themeans was zero could be rejected, and theitems’ intended proficiency level had astrong effect (d ¼ 2.77) on the empiricalitem difficulty.

TABLE 3

Descriptive Statistics of Spanish Item Difficulty by Intended Proficiency Level

N Mean SD

Intermediate 24 �1.07 0.77Advanced 33 0.77 0.54Total 57

FIGURE 5

Boxplot Distribution of Spanish Item Difficulty by Intended Proficiency Level

Foreign Language Annals � VOL. 46, NO. 1 57

Page 14: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

Language Studies SummaryThis research explored the extent to whichtest items ascend in a hierarchy of difficultylevels when both the passage and questionare based on the ACTFL Reading Pro-ficiency Guidelines. In each of the threestudies, the intended proficiency levels ofthe items resulted in statistically significantdifferent item difficulty levels. This meansthat we can reject the null hypothesisthat there is no difference between the levelsof the guidelines and can support theargument that the guidelines ascend in ahierarchy of difficulty.

It is noteworthy that no items wereexcluded from the analysis. In every testdevelopment process, items that do notfunction due to poor discrimination ormalfunctioning distractors are excludedfrom the final tests. However, as this studywas focused on the items and their intendedalignment, poorly functioning items werenot excluded. As would be expected whendealing with multiple‐choice questions,some of the items in these studies did notfunction as intended. In some instances,distractors were too attractive for theproficiency level for which the questionwas intended, which led to the item beingmore difficult than the proficiency scalewarranted. In other instances, distractorswere so much easier than the intendedproficiency that the examinees were drawnto the correct response even though thequestion and passage were beyond thestudents’ ability. This led to some itemsbeing easier than expected. To avoid anyperception of bias, these items were not

eliminated from our analyses. Had themalfunctioning items been removed fromthe analyses—as would have been routinelybeen done in typical test developmentprojects—the differences between theempirical difficulty levels would havebeen even greater.

For instance, with the Chinese test, ifwe were to eliminate 10 items from theIntermediate and 10 items from the Ad-vanced item pools, the remaining datawould have had even greater separationbetween the levels (see Table 4) with verylittle possibility of the items from theintended levels overlapping with each other(see Figure 6). A related observation is thatgiven sufficient training and time, itemwriters can indeed write level‐specific itemsthat match their targeted proficiency levelsand empirically align with the hierarchicalprogression of proficiency scales.

DiscussionThe results of this study support thedifficulty hierarchy posited by the ACTFL,ILR, and related reading proficiency guide-lines. It is not clear why previous analysesdid not obtain these results, but any of thefollowing factors may have obscured therelationships found in this study:

� Failure to align the task and conditions asdescribed at each skill level

� Failure to apply noncompensatory, crite-rion‐referenced scoring

� Failure to include subjects with a fullrange of abilities

TABLE 4

Descriptive Statistics of Chinese Item Difficulty by Intended Proficiency Level

N Mean SD

Intermediate 20 �2.07 1.81Advanced 20 0.37 0.48Superior 10 2.75 0.92Total 70

58 SPRING 2013

Page 15: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

� Failure to convert the results to intervaldata before conducting comparativeanalyses

For instance, it appears that in theresearch conducted by Lee and Musumeci(1988), questions of varying task levels wereassociated with each text sample and thatlevel‐by‐level criterion scoring was notapplied (p. 175). In addition, their Figure 6(p. 179) indicated that while the researchershypothesized that the students would im-prove a full proficiency level with everysemester of language study, the percentageof correct responses by level would suggestthat the only level where the students demon-strated sustained performance was the Inter-mediate level, or Level 1. If the higher‐levelquestions were beyond the students’ sus-tained abilities, their scores on those itemsmight have represented the problem‐solvingskills observed by Rupp et al. (2006)—andnot the relative difficulty of the texts and/orthe reading tasks presented.

Implications for Testers andInstructorsTesters wishing to assess curriculum‐

independent, real‐world reading proficiencyaccording to the ACTFL Proficiency Guide-lines should recognize that blended task andtext combinations drawn from differentproficiency levels provide insufficient infor-mation to assign criterion‐referenced profi-ciency ratings. Therefore, they should treateach major level as a separate set ofassessment criteria and ensure both thatthose criteria are aligned with the authorpurpose, text characteristics, and readertasks described for that level and that non-compensatory, criterion‐referenced scoringcriteria are used when assigning ratings.

The implications for instructors aredifferent than those for testers. Whereasblended task and text combinations drawnfrom different proficiency levels are inap-propriate for the assigning of criterion‐referenced proficiency ratings, those sameblended combinations provide useful ramps

FIGURE 6

Chinese Item Difficulty Means With 95% Confidence Interval by Intended

Proficiency Level

Foreign Language Annals � VOL. 46, NO. 1 59

Page 16: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

and scaffolding that can help studentsprogress toward higher proficiency levels.Therefore, instructors desiring to raise theproficiency level of their students shouldbegin by establishing the students’ baselevel of sustained ability and then usescaffolding techniques to incrementallyintroduce features from the next highermajor proficiency level.

“Scaffolding” is the label that is oftenapplied to the commonsense teaching prac-tice of helping learners to understand andmaster difficult or complex concepts or skillsby selectively teaching the components of thetargeted new ability one piece at a time.Then, after the pieces are in place, learnersare asked to integrate those skills to carry outa more complex and challenging task.Application of such a step‐by‐step processmakes the learning tasks more manageableand is often the most effective way to movelearners from the known (what they current-ly know or are able to do) to the unknown(what they yet need to know or do).

For example, a group of students maybe able to understand the main idea ofsentence‐length questions and answersabout topics in their immediate surround-ings, but they are not yet able to understandthe details of real‐world communicationssuch as newspaper articles. To be able tounderstand authentic news reports, thosestudents will have to acquire the lexiconneeded to describe real‐world events, learnthe verb tenses needed to place events intocorrect temporal relationships, and developthe morphological sensitivities needed tocomprehend the relationships described.Instructors will often teach each of thesecontributing skills separately and giveachievement tests to assess students’ prog-ress in each area. This teaching approach hasclear pedagogical advantages, and the testsof the isolated skills can have diagnosticvalue. However, the ability to pass a test onthe vocabulary used in a report of aparticular current event does not provethat the student can read articles on thattopic: Even the mastery of all the contribut-ing skills does not guarantee that the learner

has developed and integrated all of theabilities needed to efficiently read andcomprehend newspaper articles. Whileeach of these multiple contributing skillsis necessary, they remain enabling skillsrather than proof of general reading profi-ciency. Thus, there is a role for bothclassroom‐based progress tests where theelements being tested are not fully alignedwith the targeted proficiency level and forculminating general proficiency tests whereauthor purpose, text type used, and readertask are all aligned.

From a proficiency perspective, the factthat a reader can get the main idea (anIntermediate‐level task) of a text generatedfor an Advanced communication purposedoes not indicate Advanced reading ability.Therefore, instructors should periodicallyadminister proficiency tests to evaluate theeffectiveness of the instructional scaffoldingtechniques in promoting an increase ingeneral reading proficiency. When thereader can accomplish Advanced compre-hension tasks at the Advanced level, there isevidence of Advanced reading proficiency.

Future ResearchThe authors are proceeding to apply thelessons learned while validating the ACTFLReading Proficiency Guidelines to createtests of reading proficiency in multiplelanguages. Inherent in that developmentprocess is the need to conduct theoreticalresearch related to the assigning of sublevelproficiency scores, as well as practicalresearch into the optimal design of comput-er‐adaptive tests of reading proficiency.

Note1. There were originally 20 Superior items;

however, one of the passages was elimi-nated because of changes in global politicsthat made it culturally inappropriate.

References

ACTFL. (2012). ACTFL Proficiency Guidelines2012. White Plains, NY: Author.

60 SPRING 2013

Page 17: Empirical Validation of Reading Proficiency Guidelines · mean difficulty between or among test ... NAG) 6001 scales (STANAG 6001, 2010). ... Empirical Validation of Reading Proficiency

Alderson, J. C. (2000). Assessing reading.Cambridge, UK: Cambridge University Press.

Bernhardt, E. (2011). Understanding advancedsecond‐language reading. New York:Routledge.

Bond T. G., & Fox, C. M. (2007). Applying theRasch model: Fundamental measurement in thehuman sciences. Mahwah, NJ: LawrenceErlbaum.

Brown J. D., & Hudson, T. (2002). Criterion‐referenced language testing. Cambridge, UK:Cambridge University Press.

Child, J. (1987). Language proficiency levelsand the typology of texts. In H. Byrnes & M.Canale (Eds.), Defining and developing profi-ciency: Guidelines, implementations and con-cepts (pp. 97–106). Lincolnwood, IL: NationalTextbook.

Clifford, R. (1981). Convergent and discrimi-nant validation of integrated and unitarylanguage skills: The need for a research model.In A. Palmer, P. Groot, & G. A. Trosper (Eds.),The construct validation of tests of communica-tive competence (pp. 62–70). Washington, DC:Teachers of English to Speakers of OtherLanguages.

Engelhard, G. Jr (2008) Historical perspec-tives on invariant measurement: Guttman,Rasch, andMokken.Measurement, 6, 155–189.

Herzog, M. (n.d.). How did the languageproficiency scale get started? Retrieved De-cember 15, 2012, from http://www.govtilr.org/Skills/IRL%20Scale%20History.htm

Interagency Language Roundtable (ILR).(2012). ILR language proficiency skill leveldescriptions. Retrieved April 2, 2013, fromhttp://www.govtilr.org/skills/ilrscle1.htm

Ip, E. H. (2010). Empirically indistinguishablemultidimensional IRT and locally dependentunidimensional item response models. TheBritish Journal of Mathematical and StatisticalPsychology, 63, 395–416.

Larson‐Hall, J. (2010). A guide to doingstatistics in second language research usingSPSS. New York: Routledge.

Lee F. L., & Musumeci, D. (1988). Onhierarchies of reading skills and text types.Modern Language Journal, 72, 173–185.

Lee, Y. W. D. (2001). Genres, registers, texttypes, domains, and styles: Clarifying theconcepts and navigating a path through the

BNC jungle. Language Learning & Technology,5, 37–72.

Lewin, K. (1952). Field theory in social science:Selected theoretical papers by Kurt Lewin.London: Tavistock.

Linacre, J. M. (1991). Log‐odds in SherwoodForest. Rasch Measurement Transactions, 5,162–163 (Electronic version). Retrieved April8, 2013, from http://www.rasch.org/rmt/rmt53d.htm

Linacre, J. M. (2012). Winsteps® (Version3.75.0) [Computer Software]. Beaverton,Oregon: Winsteps.com. Retrieved January 1,2012. Available from http://www.winsteps.com/

Linacre J. M., & Wright, B. D. (2009). A user’sguide to WINSTEPS. Chicago: WINSTEPS.

Luecht, R. (2003). Multistage complexity inlanguage proficiency assessment: A frameworkfor aligning theoretical perspectives, testdevelopment, and psychometrics. ForeignLanguage Annals, 36, 527–535.

Rasch, G. (1960). Probabilistic models for someintelligence and attainment tests. Copenhagen:Danmarks Paedogogiske Institut.

Rupp, A., Ferne, T., & Choi, H. (2006). Howassessing reading comprehension with multi-ple‐choice questions shapes the construct: Acognitive processing perspective. LanguageTesting, 23, 441–474.

Smith, G. (2007). Understanding reading: Apsycholinguistic analysis of reading and learningto read (6th ed.). Mahwah, NJ: LawrenceErlbaum.

STANAG 6001. (2010). NTG—language profi-ciency levels (4th ed.). Retrieved April 8, 2013,from http://www.ncia.nato.int/Opportunities/BizOppRefDoc/Stanag_6001_vers_4.pdf

Surface E., & Dierdorff, E. (2003). Reliabilityand the ACTFL oral proficiency interview:Reporting indices of interrater consistency andagreement for 19 languages. Foreign LanguageAnnals, 36, 507–519.

Swender E., & Vicars, R. (2012). ACTFL oralproficiency interview tester training manual.White Plains, NY: ACTFL.

Submitted December 26, 2012

Accepted January 18, 2013

Foreign Language Annals � VOL. 46, NO. 1 61