inference and generalizability in applied linguistics

Inference and Generalizability in Applied Linguistics

<DOCINFO AUTHOR ""TITLE "Inference and Generalizability in Applied Linguistics: Multiple perspectives"SUBJECT "Language Learning and Language Teaching, Volume 12"KEYWORDS ""SIZE HEIGHT "240"WIDTH "160"VOFFSET "4">

Language Learning and Language Teaching

The LL&LT monograph series publishes monographs as well as edited volumeson applied and methodological issues in the field of language pedagogy. Thefocus of the series is on subjects such as classroom discourse and interaction;language diversity in educational settings; bilingual education; language testingand language assessment; teaching methods and teaching performance; learningtrajectories in second language acquisition; and written language learning ineducational settings.

Series editors

Nina SpadaOntario Institute for Studies in Education, University of Toronto

Jan H. HulstijnDepartment of Second Language Acquisition, University of Amsterdam

Volume 12

Inference and Generalizability in Applied Linguistics: Multiple perspectivesEdited by Micheline Chalhoub-Deville, Carol A. Chapelle and Patricia Duff

Inference and Generalizabilityin Applied LinguisticsMultiple perspectives

Edited by

Micheline Chalhoub-DevilleUniversity of North Carolina, Greensboro

Carol A. ChapelleIowa State University

Patricia DuffUniversity of British Columbia

John Benjamins Publishing Company

Amsterdam�/�Philadelphia

The paper used in this publication meets the minimum requirements8 TM

of American National Standard for Information Sciences – Permanenceof Paper for Printed Library Materials, ansi z39.48-1984.

Library of Congress Cataloging-in-Publication Data

Inference and Generalizability in Applied Linguistics : Multiple perspectives /edited by Micheline Chalhoub-Deville, Carol A. Chapelle and Patricia A.Duff.

p. cm. (Language Learning and Language Teaching, issn 1569–9471; v. 12)

Includes bibliographical references and indexes.1. Applied linguistics--Research. 2. Inference. 3. Reliability. I.

Chalhoub-Deville, Micheline. II. Chapelle, Carol. III. Duff, Patricia A. IV.Series.

P129.I455 2006418.072--dc22 2005055573isbn 90 272 1963 X (Hb; alk. paper)isbn 90 272 1964 8 (Pb; alk. paper)

© 2006 – John Benjamins B.V.No part of this book may be reproduced in any form, by print, photoprint, microfilm, orany other means, without written permission from the publisher.

John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The NetherlandsJohn Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa

JB[v.20020404] Prn:22/02/2006; 11:39 F: LLLT12CO.tex / p.1 (46-97)

Table of contents

Drawing the line: The generalizability and limitations of researchin applied linguistics 1

Micheline Chalhoub-Deville

I. Perspectives on inference and generalizability in applied linguistics

Old and new thoughts on test score variability: Implicationsfor reliability and validity 9

Craig Deville and Micheline Chalhoub-Deville

Validity and values: Inferences and generalizability in language testing 27Tim McNamara

L2 vocabulary acquisition theory: The role of inference,dependability and generalizability in assessment 47

Carol A. Chapelle

Beyond generalizability: Contextualization, complexity,and credibility in applied linguistics research 65

Patricia A. Duff

Verbal protocols: What does it mean for research to use speakingas a data collection tool? 97

Merrill Swain

Functional grammar: On the value and limitations of dependability,inference, and generalizibility 115

Diane Larsen-Freeman

A conversation analytic perspective on the role of quantificationand generalizability in second language acquisition 135

Numa Markee

JB[v.20020404] Prn:22/02/2006; 11:39 F: LLLT12CO.tex / p.2 (97-111)

Table of contents

II. Discussion

Generalizability: A journey into the nature of empirical researchin applied linguistics 165

Lyle F. Bachman

Generalizability: What are we generalizing anyway? 209Susan Gass

Negotiating methodological rich points in applied linguisticsresearch: An ethnographer’s view 221

Nancy H. Hornberger

Author index 241

Subject index 245

JB[v.20020404] Prn:13/12/2005; 15:02 F: LLLT1201.tex / p.1 (47-105)

Drawing the line

The generalizability and limitations of researchin applied linguistics

Micheline Chalhoub-DevilleUniversity of North Carolina, Greensboro

Research in applied linguistics investigates a range of complex issues and con-sequently it relies on equally complex approaches to research methodology.Methods for obtaining insights and evidence about language issues criticallyaffect the outcomes of research and are therefore an important object for dis-cussion and reflection. For a number of years, the joint sessions of the Inter-national Language Testing Association (ILTA) and the American Associationfor Applied Linguistics (AAAL) have provided a forum for such discussion andreflection. Many of the chapters in this volume originated in the session at the2002 meeting of the AAAL in Salt Lake City. Following tradition for the jointILTA-AAAL session, this one brought together researchers working in differentareas of applied linguistics to share their views and debate issues.

The topic for the 2002 meeting, “Drawing the line: The generalizability andlimitations of research in applied linguistics,” asked participants to explicatethe perspectives underlying their approaches to data collection, analysis, inter-pretation of results, and generalizability of findings. In particular, participantswere to discuss their views on the meaning and relevance of (1) dependabilityin data collection and analysis, (2) inferences made on the basis of observa-tions, and (3) generalization of research results. Participants, most of whomhave contributed to this volume, were asked to address the following questionswith respect to their specific area of investigation:

1. How does research in your area address the dependability and generaliz-ability of elicited judgments and observations?



2. How does research in your area deal with the appropriateness of inferencesmade based on observed learner performances?

3. How are issues of dependability, generalizability, and appropriateness ofinferences dealt with in diverse research paradigms prevalent in your area?

The three terms dependability, generalizability, and inference are obviouslycentral to these questions. Although the purpose of this volume is to defineand discuss these terms from a variety of perspectives, I offer a brief orien-tation to each of the concepts as a point of departure. My own education,background, and biases in language assessment are readily evident in both theconcepts selected for discussion and how I define them.

Dependability

Dependability or reliability, in a broad sense, refers to the consistencies of data,scores, or observations obtained using elicitation instruments, which can in-clude a range of tools from standardized tests administered in educational set-tings to tasks completed by participants in a research study. From an assessmentperspective, the random variation obtained from performance is called “error,”which limits the degree of reliability, or dependability, of the results obtainedfrom the instrument. One of the preoccupations of researchers in assessmentis identify sources of error variance that might impinge on assessment results.

One such source of error is the elicitation method itself. As a researcher,how do I know that the method or instrument I am using to collect my datawill produce dependable results? The fact that I, as a researcher, thought the in-strument would render results deemed trustworthy does not necessarily meanthat it will. Evidence should be provided to document the dependability ofresults obtained from any instrument used in our investigations.

A second potential source of error is the number of elicitations. There is apotential danger in getting information from only one observation. In princi-ple, the more observations generated, the more confidence one can have in thedependability of results.

A third potential source of error can be attributed to the influence of theinterlocutor or observer involved in the elicitation. The rater/researcher judg-ing the quality of a performance plays a significant role in the results obtainedand therefore can introduce error. In summary, researchers need to identifyand document all factors that play a salient role in data collection. The pointof identifying these sources of variability is to minimize the error they pro-


Drawing the line

duce and estimate their magnitude so that results can be interpreted and usedappropriately.

Generalizability

Generalizability refers to the extent to which research results can justifiably beapplied to a situation beyond the research setting. What are the characteristicsof the domain, e.g., the learners and tasks, to which the results apply? The char-acteristics of the learners under investigation and the larger group to which wewant to generalize or transfer our results are issues to consider when discussinggeneralizability. The comparability of learners in a study to those who were notincluded should be discussed in terms of the applicability of research results.The methods used for eliciting samples from language learners play a criticalrole in terms of generalizability because elicitation methods affect the type ofdata obtained. We, as researchers, need to be aware of instrument effect onthe phenomenon under examination and therefore on the resulting observa-tions; findings based on one method/task cannot be assumed to generalize toperformance on similar/dissimilar tasks.

Inferences

In language assessment, inference refers to the connection that the researchermakes between observations of learners’ performance and interpretationsabout the meaning of that performance. The process of inference is not auto-matic or independently objective – it requires justification backed by evidence.Also critical to the construction of inferences is who is making these infer-ences and for what purpose. For example, we can argue that test developersand researchers are not the only ones who provide interpretations of scores;test takers and users also interpret meaning from scores. The link betweendata generated and the interpretations and intended uses requires making acase based on the convergence of theoretical arguments and empirical evidencefrom multiple sources.

The three concepts – dependability, generalizability, and inference – aredealt with implicitly or explicitly in any research undertaken in applied linguis-tics. As already indicated, however, one could make the case that the questions,as defined and posed here, are indicative of a particular research paradigmor/and perspective. One may argue that the concepts focused upon in these



questions may not be considered particularly relevant from certain researchperspectives, or they may be emphasized in different ways. Another consider-ation is the concern that these questions and concepts stand in opposition toeach other. As such, what is the appropriate balance of these concepts? Whereare we to draw the line on how much attention to pay to each of these concepts?This volume begins to explore how these concepts come into play differentlyacross various research perspectives. Such cross-paradigm communication isultimately in the best interest of applied linguistics as a discipline and willhopefully engender increased confidence in the claims made on the basis ofour field’s research.

Chapters in this volume

The chapters in this volume are organized into two sections. The first sectionconsists of chapters that address the three questions listed above – each chap-ter received comments from the editors and was revised. The second sectioncontains discussion chapters. Discussants read and offered commentary on thechapters presented in the first section – discussants had a free hand to commenton and discuss these chapters.

The first chapter, “Old and new thoughts on test score variability: Impli-cations for reliability and validity” by Craig Deville & Micheline Chalhoub-Deville, focuses on the meaning of variability when interpreting scores from anassessment. The second chapter, “Validity and values: Inferences and general-izability in language testing” by Tim McNamara provides extensive discussionof the three terms and issues based on work in language assessment. In thethird chapter, Carol Chapelle applies the three concepts to the problem oftheory development and assessment of the L2 lexicon in a chapter entitled“L2 vocabulary acquisition theory: The role of inference, dependability andgeneralizability in assessment.”

The fourth chapter, “Beyond generalizability: Contextualization, complex-ity, and credibility in applied linguistics research” by Patricia Duff provides atransition by explaining the concerns of quantitative researchers and contrast-ing these with the values and priorities of qualitative researchers. Next, MerrillSwain examines the qualitative methodology of verbal protocols as a meansfor eliciting SLA data in terms of data interpretation and generalization in herchapter, “Verbal protocols: What does it mean for research to use speaking as adata collection tool?” Diane Larsen-Freeman’s chapter, “Functional grammar:On the value and limitations of dependability, inference, and generalizability”


Drawing the line

explores dependability, generalization, and inference in the study of functionalgrammar. Finally, Numa Markee looks at how these terms apply in conversa-tion analysis research with his chapter, “A conversation analytic perspective onthe role of quantification and generalizability in second language acquisition.”

The three discussion chapters in the second section identify cross-paradig-matic issues and extend the discussion to larger issues in applied linguisticsresearch. Lyle Bachman’s chapter, “Generalizability: A journey into the na-ture of empirical research in applied linguistics,” draws on perspectives fromthe philosophy of science to explain differing views across the chapters. SusanGass explores the importance of generalizability for interpretations made aboutlearning in her chapter, “Generalizability: What are we generalizing anyway?”Finally, Nancy Hornberger discusses the points of disagreement and synergyacross the chapters in her contribution, “Negotiating methodological richpoints in applied linguistics research: An ethnographer’s view.”

By moving the ideas raised at the AAAL conference venue into print, theeditors hope that the “rich points,” as Hornberger put it, will appear moresalient to researchers in applied linguistics and that these points will serve asan impetus for continued discussion and development in our profession.

JB[v.20020404] Prn:22/02/2006; 11:51 F: LLLT12P1.tex / p.1 (47-75)

Perspectives on inference andgeneralizability in applied linguistics


Old and new thoughts on testscore variability

Implications for reliability and validity

Craig Deville and Micheline Chalhoub-DevilleUniversity of North Carolina, Greensboro

The paper discusses the variability of test takers’ performances acrossdifferent language tasks. We concentrate attention on this aspect of variabilitybecause inconsistent achievement by test takers across tasks has emerged as asignificant threat to reliability and validity. The first section of the chapteraddresses how language testers have traditionally conceptualized andmeasured variability, while the second part advocates an alternate way ofthinking about the issue. Our intent should not be construed as a defense ofor apology for the dominant paradigm employed by many language testers.Instead, we hope to problematize our standard thinking of concepts such asvariability and error, concepts dear to the heart of conventional testers butsometimes anathema to those of a different epistemological bent.

Introduction

Language testers are typically interested in meaningfully measuring the sec-ond language (L2) proficiency of test takers and making inferences from thetest scores and/or observed behaviors to a test taker’s ability to use language ina particular context or for an identified purpose. We language testers wish tomake generalizations – sometimes across test takers, other times across items ortasks, and/or at still other times across occasions or contexts. Making inferencesand generalizations involves us in issues of reliability and validity. Such a per-spective and approach to testing is rather conventional, based on well-knownpsychometric principles. The approach has been labeled positivist (McNamara2001). One can dispute the positivist label (Yu 2003), but McNamara’s argu-



ment as to language testers’ over-reliance on a particular paradigm of knowl-edge, research, and values is right on target.

The present authors carry a load of ‘positivist’ baggage and will thereforeaddress issues of reliability and validity from that standpoint in the first partof this chapter. Our emphasis will be, from a measurement perspective, onthe inferences and dependability/generalizability of L2 test scores across testitems/tasks and, tangentially, extrapolations to real-life situations in which oneuses their L2. Like others (e.g., Bachman & Palmer 1996; Chapelle 1999) weincorporate issues of reliability under the umbrella of test score validation andconsider reliability evidence as a necessary component of a validity argument.The reader is thus encouraged to keep in mind that validating the interpre-tation and use of L2 test scores underlies any discussion of what constitutesreliability.

We will use the three terms reliability, dependability, and generalizability assynonymous unless specifically indicated otherwise. We use these terms to re-fer to the consistency of test scores across various facets of measurement suchas items, tasks, raters, occasions, etc. We use the terms infer and inference toindicate extrapolation from a test taker’s performance on a test to her/his per-formance in a real-life situation. We will not discuss the representativeness orgeneralization of samples of test takers to larger populations. Although thisis an important consideration when evaluating certain inferences about testscores, we see it as an issue related to sampling procedures and not directly toscore reliability.

In this chapter our focus will be primarily on the variability of test tak-ers’ performances across different language tasks. We choose to concentrateattention on this aspect of variability because inconsistent achievement by testtakers across tasks has emerged as the most significant threat to the reliabilityand validity of test scores from performance assessments. The first section ofthe chapter addresses how language testers have traditionally conceptualizedand measured variability. The second part advocates an alternate way of think-ing about several of these issues, predominately how we think about test scorevariability.

The concept of variability is critical to any discussion of reliability, and byextension, of validity. Our intent in this chapter should not be construed as adefense of or apology for the dominant paradigm employed by many languagetesters. Instead, we hope to problematize our standard thinking of conceptssuch as variability and error, concepts dear to the heart of conventional testersbut sometimes anathema to those of a different epistemological bent.


Old and new thoughts on test score variability

In conclusion, the present chapter discusses common practices and consid-erations in the language testing field, including some concerns and limitationsof standard practices, and suggests alternative research approaches that mayenrich our understanding of the L2 construct. (The term, L2 construct, is usedas a placeholder throughout the chapter to represent any language construct ofinterest to researchers.) The authors draw on literature outside of the appliedlinguistics domain, e.g., from the areas of general measurement and develop-mental psychology, in order to inject some different and fresh perspectives onthe L2 construct investigation.

The roles of language testers

Language testers usually wear two hats. First and foremost they are testers. Onemight even say, applied testers. They are often commissioned to develop a L2test because an organization or institution has a need to distinguish amongexaminees or candidates with respect to their L2 proficiency. The commis-sioner of the test is often the one(s) who largely determines the purpose andcontext of the assessment, the decisions to be made based on scores, and thenumerous administrative and operational constraints on test development andresearch. Within this framework language testers have investigated numerousresearch questions and advanced the knowledge of the field with respect tothe L2 construct. For the sake of the present discussion, however, the authorsclassify language testers as both applied and basic researchers and examine therespective orientations embodied in these two roles.

Sometimes, indeed rarely, a language tester might have the luxury of wear-ing her research hat without having to worry about the many stones in herpockets, pulling her under. To belabor the metaphor, the numerous stonesor applied/pragmatic constraints on the tester prevent her from undertakingunencumbered construct research because the heavy realities just barely al-low her to keep her head above water. Nevertheless, we language testers canand should think beyond the impediments of conventional testing and con-sider what we as researchers would like to investigate. But first, beginning withour applied tester’s hat, we will focus on some important notions that influ-ence how we typically think about reliability, while threading considerations ofvalidity throughout our discussion.



Language testers as testers

Measurement professionals – language testers included – are expected to ad-here to agreed upon standards that inform sound and ethical test practices.These are commonly known as the Standards (see AERA, APA, & NCME 1999).The Standards, reflecting the fact that psychometrics, assessment practices,legal decisions, and social values are constantly evolving, have been revisedand re-published approximately every decade since their debut in the 1950s.Each edition, however, has emphasized the importance of reliability as a nec-essary component when evaluating tests and testing practices. The most recentStandards (1999) state that:

The usefulness of behavioral measurements presupposes that individuals andgroups exhibit some degree of stability in their behavior. However, successivesamples of behavior from the same person are rarely identical in all perti-nent respects. An individual’s performances, products, and responses to setsof test questions vary in their quality or character from one occasion to an-other, even under strictly controlled conditions. This variation is reflected inthe examinee’s scores. (p. 25)

The assumption of a stable construct within individuals is a cornerstone whendiscussing traditional notions of reliability and validity. As testers, whose in-struments are used to make decisions about test takers, we need to be con-cerned with how stable and/or how variable our test scores are.

Variability and errorThe concepts of variability and error are crucial to any conventional discussionof reliability, so it is important to understand what we mean by these terms.Traditionally, second language acquisition (SLA) researchers (e.g., Ellis 1994;Tarone 1988) have examined variability of performance within an individualconfronted with various language tasks in order to study task variables. On theother hand, language testers have examined variability of test scores across ex-aminees so as to arrive at decisions with respect to the person’s language ability,as inferred from the test performance. Many of us who are involved with lan-guage testing and measurement have a somewhat ambivalent view of the vari-ability we find in test takers’ scores. Assuming our interest is concerned withindividual differences, we testers like to think that ‘good’ (i.e., relevant or use-ful) variability is what differentiates test takers from each other with respect totheir language ability, or construct of interest. ‘Bad’ (or irrelevant, confound-ing) variability, on the other hand, is considered error, and it can discredit ourfaith in the test scores as trustworthy indicators of the differences among test



takers. This thinking, in a nut shell, represents the conventional, psychomet-ric approach to conceptualizing and analyzing the reliability of test scores. Themore relevant variability we have relative to the irrelevant, the higher – andbetter – the reliability estimate. That is, we say a set of scores are reliable whenwe can ascertain that differences among examinees are due to individual dif-ferences in the L2 construct, e.g., ability to read academic texts, and not due tomeasurement error, e.g., test environment conditions at a particular setting.

One of the primary intended outcomes of administering a test to a groupof examinees is our ability to separate the test takers based on their scores. Wemight separate them along some continuum or score scale or we may separatethem into categories, e.g., proficiency levels or pass-fail status. The point isthat we use the test scores, in some fashion, to order our test takers so that wecan infer which individuals are more proficient in the language, or in certainaspects of the language, and which are less so.

Many types of decisions are made based on test scores, i.e., on theinferences we make regarding the test taker’s language ability. We assigngrades/marks, we decide which students gain admission to our universities, weplace students into certain courses, we select certain candidates for jobs, etc.For such decisions affecting test takers to be considered fair, we must, amongother things, be able to depend on the scores as accurate indicators of the testtakers’ ability, especially when these decisions involve an inference of one ex-aminee’s ability relative to another’s. A reliability estimate tells us somethingabout how consistent or dependable our test scores are, and by extension, howmuch faith we can have in – how valid are our – subsequent inferences anddecisions stemming from those scores.

As straightforward as the concept of reliability is, it becomes complicatedwhen we attempt to define and explicate what comprises good and bad variabil-ity (and devise methods to quantify these). Much of the complication occursbecause we have not adequately specified what constitutes good and bad vari-ability. Too many test developers and researchers are willing to leave reliabilityto the psychometrician – or adopt whatever is made available by the psycho-metrician – who might simply calculates the prototypical reliability estimate,coefficient alpha. One issue here is that alpha may not be the appropriate esti-mate of reliability. But more importantly, the test developer/researcher and thepsychometrician should consider how the assessment procedures might influ-ence and/or determine both differences in the scores related to the constructand error in the test scores. For example, unless the different sources of testscore variability are properly identified, as discussed in the following section,



it is impossible to accurately quantify how much variance can be attributed tothe L2 construct of interest.

Conceptualizing reliabilityPsychometricians have devoted extensive work over the decades to derive ap-propriate models for estimating reliability (see Feldt & Brennan 1989), andvirtually all of these models are based on differing conceptualizations of whatconstitutes variability in test takers’ performances that can be attributed to theconstruct of interest, and what constitutes variability due to error. It is not ourintent here to present different reliability models, but instead to draw attentionto situations where language test developers should consider what might leadto variability in test takers’ scores.

As mentioned above, the most widely used indicator of reliability, i.e., al-pha, may not always be the most appropriate estimate (Hogan, Benjamin, &Brezinski 2000). Depending on a number of factors with respect to how agiven test is constructed (discussed below under the rubric of testlets), alphamay under- or over-estimate reliability. We language test developers should beaware of these factors where applicable and discuss them with the psychome-trician so that reliability can be investigated and established appropriately. Toillustrate this point, we consider reliability issues associated with testlets, whichare quite common in language testing.

A testlet is a bundle of items whereby the items are linked because theybelong to the same reading or listening passage, because they have the sameformat, among other reasons (Lee, Brennan, & Frisbie 2000). There are numer-ous approaches to estimating reliability when testlets are involved (e.g., Feldt2002; Lee, Dunbar, & Frisbie 2001), but the point is that alpha is probably notan appropriate estimate because there is likely some dependency among itemswithin bundles. Furthermore, of concern is not simply the use of a differentestimation technique, but the consideration of how a testlet is defined with re-spect to the test content. For example, reading items may belong to a particularpassage, but they might also be ‘bundled’ according to subskill (e.g., identifyingauthor’s purpose), type of passage (e.g., fiction or narrative), or other contentstrata (Lee 2002). And, however we parcel our items, do we consider them torepresent a random sample of possible, say reading passages – a very plausibleapproach – or do they represent a fixed selection, e.g., reading passages cover-ing science topics only? These substantive considerations are likely spelled outin the test specifications and can provide important guidance for estimatingreliability.



Very much related to the use of testlets is the use of multiple item formatswithin a test, especially when these vary in length and their contribution toa total score (Feldt & Charter 2003; Qualls 1995). It is not uncommon for alanguage assessment, especially classroom assessments and placement exams,to contain some selected response items (e.g., multiple-choice), short answerquestions, and even an extended question. Moreover, widespread practice is toaward a different number of points for the various test parts. For example, auniversity placement examination may consist of 10 multiple-choice readingquestions and a writing task worth 20 points. With such tests, some considera-tion should be given to how the different sections likely contribute to variabilityin ‘true’ scores and error. With the above mentioned placement test, it is clearthat the two sections very likely contribute differentially to the variability oftest scores. We might, then, further deliberate whether separate scores shouldbe given for each test section, or what a composite score is intended to reflect.Again, these are all important deliberations so that we can accurately estimatereliability.

Another test format/method we language testers employ are so-called in-tegrative tasks (see Oller 1979), such as cloze tests, C-tests, recall protocols,etc., which entail different challenges when estimating reliability (Deville &Chalhoub-Deville 1993). A careful investigation into how an ‘item’ is definedfor scoring purposes (e.g., exact word cloze or recalled words versus propo-sitions) and the interrelationship(s) of the items and/or item bundles is war-ranted. These issues have significant implications for how we measure reliabil-ity. Once again, the unquestioning adoption of traditional reliability estimatesobviates the need to consider critical aspects of test construction, tied to testcontent, which should inform us when we estimate test score reliability.

The foregoing description of testing practices demonstrates the interrelat-edness of test construction and content with how we measure reliability. Thesepractices and considerations stem from the applied work we language testersdo as testers. Moreover, they are based on the traditional operationalizationof reliability found in both the general measurement and language testing lit-erature. Moss (1992, 1994), however, offering an interpretive or hermeneuticparadigm, challenges the notion that traditional reliability is the sine qua nonfor good measurement and hence for any claim to validity as well. In a similarvein, Nesselroade and Ghisletta (2000), who work in the field of developmentalpsychology, argue that the fundamental perspective that constructs are sta-ble entities – a premise they dispute, but one that is presupposed in the 1999Standards (see citation above) – has influenced how we practice measurement,indeed how we conceptualize reliability and validity. Reliability estimates based



on test-retest correlations, a notion that forms the basis of classical test the-ory, obviously presumes a static construct within individuals. Finally, withinthe SLA and language testing fields, Swain (1993) has also argued that tradi-tional notions and practices of psychometrics may not be appropriate or serveus well when we define the language construct and performance in terms ofco-construction (see below).

In the next section we will address some alternative ways we might thinkabout several of these standard testing concepts. We will look at language testersas basic researchers, freed from some of the pragmatic constraints imposed bythe applied framework.

Language testers as researchers

What are the issues that language testers as researchers might investigate ifthey needn’t be so concerned with generalizing test scores? In this section wecontinue to examine aspects of variability and error conceptualization as theypertain primarily to tasks. We note the consistent findings in the L2 and thegeneral measurement fields with regard to an interaction between persons,i.e., test takers, and tasks and we consider the conventional interpretation ofthis variability as error/noise. We suggest that in our role as language testingresearchers we can sidestep some of the usual thinking about variability anderror and explore new ways of examining variability. We make a case that per-son x task variability is not to be relegated to nuisance or error variance butshould be regarded as being at the heart of the study of the L2 construct. Wethen suggest how this conceptualization of interaction and variability can beexamined within the context of a ‘theory of context.’ Being applied testers atheart, we maintain that this unconventional approach can inform both testscore interpretation, i.e., what test takers’ scores mean in regard to our con-struct definition, and test use, i.e., what decisions are made based on test takers’scores (Messick 1989).

Test method variability and the L2 constructMany inferences about test takers are essentially generalizations/predictionsabout their performance in non-testing situations. We typically take a snap-shot picture of a test taker’s language ability based on the person’s performanceon a limited number and type of tasks. As practitioners, we would like to knowthat if we were to examine these same test takers again, we would obtain a con-sistent, replicable ranking of scores. We would like to be able to tell how reliableand/or generalizable the ranking of those scores is.



From both the language testing literature (Chalhoub-Deville 1995a, 1995b;Lee, Kantor, & Mollaun 2002; Shohamy 1984) and general measurement stud-ies (Brennan 1992, 2001; Linn & Burton 1994; Shavelson, Baxter, & Goa 1993)we know that, even though our test items, tasks, or measurement methodspurportedly represent the same content/construct, test takers perform dif-ferentially across these measures. This is typically referred to as a methodor an interaction effect, meaning that test takers are not performing consis-tently across tasks (defined broadly as including anything from a response toa multiple-choice grammar item to performance on a complex language ac-tivity). As testers, we sometime write this variability off as measurement noiseor error because a strong interaction effect prevents us from ranking our ex-aminees consistently across the measures. That is, we sometimes attribute thissource of variability to error, which then affects our estimate of reliability. Moreimportantly, however, evidence of a test method effect is difficult to interpret.Undoubtedly, a method/interaction effect is due in some part to error. Yet, theissue of person x task interaction mitigates not only our confidence in the reli-ability of our measures, but in their validity as well. It may very well be thatevidence of test method effect is due to our inability to specify a homoge-neous construct (Fiske 2002). If our construct definition is not concise andtight we should not be surprised that diverse assessment tasks yield an incon-sistent/variable picture of examinee performance. With this statement we turnour attention to how the L2 construct has been defined over the decades. Ofparticular interest here is to what extent L2 models of language proficiencyrepresent a clearly defined construct, and what the relationship is between aperson’s language ability and task features (Chapelle 1998).

Chalhoub-Deville and Deville (2003), by examining texts on language test-ing published since Lado’s (1961) seminal work, traced the development in thefield over the years of the construct definition of language proficiency. AfterLado, some of the most influential works have been (in chronological order):Oller (1979), Canale and Swain (1980), Omaggio (1986), Bachman (1990) andBachman and Palmer (1996), and McNamara (1996). Chalhoub-Deville andDeville argue that the construct has largely been defined according to a psy-cholinguistic and cognitive paradigm. Language proficiency has been viewedas an entity that resides in the heads of our test takers (or students or researchsample) and is deemed homogenous and static, permitting us to capture a mea-sure of it. Only recently, in the Bachman model, has task been included as animportant aspect of the construct representation of language use. (While Ollerdid promulgate the use of integrated tasks to tap into test takers’ expectancygrammars, he never addressed variability in performance.) Although task is



given some prominence in the Bachman model, it is separated from the abilitiesunderlying language use. The separation is maintained to allow generalizationof scores based on transferable abilities (see Parkes 2001).

In conclusion, similar to the conceptualization of construct by measure-ment practitioners, we language testers have made the same assumptions aboutthe L2 construct. The L2 construct is viewed as a stable and homogenous setof ability components while the task is represented as an independent or sep-arate entity. Moreover, performance variability due to interaction between testtakers and tasks is seen as a threat to our desire to generalize.

While recognizing the value of a psycholinguistic and cognitive view oflanguage proficiency, we maintain that the consistent finding in the litera-ture of person x task interaction compels us to reconsider the assumption ofa homogeneous construct as well as the role of tasks in the representationof the construct. We need to explore other representations of the constructthat accommodate findings in the field. Chalhoub-Deville and Deville (2003)advocate considering a perspective based on the role of the social interac-tion and co-construction of language and knowledge (Kramsch 1986; Johnson2001; Swain 2001; Young 2000). While some might dismiss this perspective assimply a different wrinkle to mainstream cognitive models of language abil-ity, Chalhoub-Deville (2003) makes a strong case that an “ability-in-languageuser-in-context” approach (as she calls it) to defining the language constructrepresents a fundamentally different view of language and has important im-plications for how we test for and research language ability.

Chalhoub-Deville (1997, 2003) has, for some time, urged language testersto consider context when doing validation work. She maintains that local the-ories or frameworks are more informative and useful than comprehensive,catch-all models for developing language tests and for theorizing about theconstruct. Recently she has incorporated thinking from other fields (e.g., ap-titude theory) to strengthen and push her argument for the importance ofcontext whenever one investigates the language construct. For her, the con-struct does not reside in the head of the test taker, but is bound inextricablyto the interaction of the person and the task/context. Chalhoub-Deville (2003)concludes her paper by calling for a theory of context.

Variability and contextWhile the idea of a context dependent representation of the construct might benew to the field of language testing (see, however, Douglas 2002), it has beendiscussed in the discipline of psychology for some time now (e.g., see Valsiner1984, and the two edited works, Toward a Psychology of Situations, 1981, and



Mind in Context, 1994). Magnusson (1981, 2002), whose work in Sweden indevelopmental psychology has fostered increased attention to contextual orenvironmental variables, writes:

Situations present, at different levels of specification, the information we han-dle, and they offer us the necessary feedback for building valid conceptions ofthe outer world as a basis for valid predictions about what will happen andwhat will be the outcome of our own behaviors. By assimilating new knowl-edge and new experiences in existing categories and by accommodating oldcategories and forming new ones, each individual develops a total, integratedsystem of mental structures and contents in a continuous interaction with thephysical, social, and cultural environments. (1981:9)

Magnusson goes on to point out that behaviors take place in certain situationsand that it is impossible to understand, explain, predict, or model behavior inisolation from the conditions present in the context.

It is noteworthy that the two gurus of validity theory, Cronbach (1971,1989) and Messick (1989) seem to disagree in their respective views of context(Moss 1992). Messick seems to hold the view that context is not meaning-ful for score interpretation, only for generalizing to like contexts. He writes:“the intrusion of context raises issues more of generalizability than of inter-pretive validity. It gives testimony more to the ubiquity of interactions than tothe fragility of score meaning” (p. 15). Messick all but relegates context to anuisance factor.

In contrast, Cronbach (1989) maintains that the issue of the generaliz-ability of context is tied to construct validation: “Any interpretation invokesconstructs if it reaches beyond the specific, local, concrete situation that wasobserved” (p. 151). Cronbach unambiguously links generalizations across con-texts to construct interpretation, something which echoes what was alreadysaid with regard to Chalhoub-Deville’s (2003) insistence that context can notbe separated from construct. A model or framework that treats “ability-in-language user-in-context” as the unit of analysis is called for in both practicaland theoretical validation work.

If we are to treat ability-in-language user-in-context as the subject of in-vestigation, we need to be clear as to what we mean by this interaction. Snow(1994), in discussing the relations between person abilities and tasks/situations,lists four interpretations and research approaches to interaction. The first hecalls independent interaction, which interprets person and task variables in-dependently from each other. Person and situation variables are measuredseparately and can be interpreted apart from one another. An example from



language testing would be the ILR/ACTFL proficiency scales, where tasks, e.g.,reading passages, are selected and administered to specific-ability test takersbased on their supposed complexity.

The interdependent perspective also sees person and task variables sepa-rately, but they exist in relation to one another, e.g., task difficulty must beinterpreted with respect to person ability. Bachman (2002) is a strong advo-cate of this viewpoint, stating “difficulty does not reside in the task alone, butis relative to any given test-taker” (p. 462). Bachman makes this statement inarguing the fruitlessness of studying item/task difficulty factors independentlyfrom test takers, yet he stops short of making person-situation the unit of anal-ysis. As stated above, Bachman advocates an approach that clearly distinguishesthe two entities from one another.

Snow’s next two depictions of interaction differ from the first two in thatperson and task are now seen as inseparable. The third type of interaction islabeled reciprocal, and is characterized by person and task variables acting to-gether to change one another over time. The example Snow provides is a personworking on a task who changes or adopts new strategies to solve a problem.Reciprocal interaction would find supporters from the interactionalists andco-constructionists (Kramsch 1986; Johnson 2001; Swain 2001; Young 2000)mentioned above. To illustrate, suppose two test takers were given the task offinding an apartment agreeable to both of them based on descriptions of apart-ments in a local newspaper and specific likes and dislikes given each test taker.While the tester may have constructed the task with a particular outcome inmind, the test takers may negotiate, compromise, express additional desiredattributes of the apartment, and even disagree in the end. The test takers maydigress and discuss the pros and cons of city versus rural living. Thus the chal-lenge becomes how to interpret performances related to the intended constructand the unintended – but authentic and related – test behavior. In addition,testers must account for the interactive nature of the task, whereby the perfor-mances of the individual examinees are intertwined and the evaluation criteriamay have to be flexible.

Snow refers to the fourth approach as transaction, where person and sit-uation are in constant reciprocal interaction. In such a system there is nocause and effect, and person-situation constructs exist only in the relationsand actions between them. The focus of any scientific inquiry then, from thisstandpoint, looks at the relations or actions of person-in-situation systems.Again, the interactionalists and co-constructionists might be classified here. Aviewpoint that sees the assessment of language for specific purposes as strictly



context bound, yet dynamic and fluid within the particular context, would betransactional.

Snow classifies his own work in aptitude theory as representing both recip-rocal interaction and transaction. Important to note is Snow’s caution againsttaking the extreme version of transaction, which he calls “new situationism.”This perspective claims that we can learn nothing by studying either person orsituation variables, but only their relationship. In arguing against new situa-tionism, Snow cites Allport, who in 1955, wrote:

. . . there are some things about both organism and environment that can bestudied and known about in advance of. . . [as well as after] their interrelation-ship, even though the fact that in their relationship they contribute somethingto each other is undeniable. What then, is this residual nature or property ofthe parties to a transaction? (p. 287, emphasis in original)

Snow concludes with a call for eclecticism, saying that each approach toperson-situation interaction has some merit.

If ability-in-language user-in-context is to be the unit of analysis whenstudying language learning and testing, however, we should contemplate howto incorporate the reciprocal interaction and transaction approaches into ourthinking and research. One obvious implication is that we consider focusingour research less on the effects of particular variables across individuals andmore on the identification, description, and understanding of dynamic sys-tems and processes within person x situation interactions. What aspects of theability-in-language user-in-context are relevant, which are related, how strongare the relations, are they malleable, how do they change over time? These ques-tions require us to look for patterns within persons, bringing us back to a needfor the study of person x task variability mentioned at the outset of this sectionon language testers as researchers.

According to Magnusson (1981), we must make a distinction between gen-eral and differential effects of situations on behavior, i.e., for our purpose, theeffects of context/task on language use. A general effect is the same acrossindividuals, meaning that different contexts affect all test takers similarly, al-tering their behavior or language use in like ways. Differential effects, however,mean that:

[a]ccording to an interactional model the characteristic of an individual isin his/her unique pattern of stable and changing behaviors across situationsof different character or, in terms of data, in his/her partly unique cross-situational profile for each of a number of relevant behaviors.

(p. 11, italics in original)



The language test researcher who analyzes the main, general effects of contextvariables across test takers and dismisses the differential effects of context asunwanted noise or error is assuming a stable, unchanging construct of languageability. Yet, as Chalhoub-Deville and Tarone (1996) argue:

. . . the nature of the language proficiency construct is not constant; differentlinguistic, functional, and creative proficiency components emerge when weinvestigate the proficiency construct in different contexts. (p. 5)

To conclude, we maintain that in our investigations as language test researcherswe need to explore constructs not simply to see if they are sensitive to featuresof the context and task, but also whether their make-up changes dynamicallyas a communicative situation unfolds. So, in essence, we likely are looking atperson x context x process interactions. Trying to unravel the enormous com-plexity of social – and statistical – interactions brings us full circle back toreliability and validity, i.e., to the question of what our test scores really mean.

Summary and conclusion

In this paper we have reviewed some of the basic concepts of conventional reli-ability theory and practice and attempted to provide a rationale as to why theseconsiderations are important when test scores are used to make specific deci-sions about examinees. The traditional views of variability and error, and howthese contribute to our ability to generalize our findings and make warrantedinferences, are closely tied to the demand for consistency when evaluating testtakers. In addition, we delineated some considerations with respect to howlanguage tests are constructed – the kinds of items/tasks used, how these are in-terrelated, how these are scored – and what implications test construction hasfor determining variability in test scores, estimating reliability, and marshallingappropriate validity evidence.

Donning the cap of language test researchers we provided some alternativeways of thinking about the basic concepts of reliability, i.e., score variabil-ity, and suggested how these might be adopted and adapted in pursuit of anexpanded understanding of our construct. We call for the study of ability-in-language user-in-context, where the interaction of language user and context isthe focus of study. In addition, we advocate increased attention to the constructof context, and suggest that person x task investigations might offer promise forexamining the L2 construct.



Acknowledgements

The authors would like to thank Yong-Won Lee and Carol Chapelle for theirinsightful and helpful comments and suggestions. Any and all shortcomingsto the ideas and arguments presented here, however, must be attributed to theauthors.

References

Allport, F. H. (1955). Theories of perception and the concept of structure. New York: Wiley.American Educational Research Association [AERA], American Psychological Association

[APA], & National Council on Measurement in Education [NCME]. (1999). Standardsfor educational and psychological testing. Washington, DC: Author.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OUP.Bachman, L. F. (2002). Some reflections on task-based language performance assessment.

Language Testing, 19, 453–476.Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: OUP.Brennan, R. L. (1992). The context of context effects. Applied Measurement in Education, 5,

225–264.Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag New York, Inc.Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second

language teaching and testing. Applied Linguistics, 1(1), 1–47.Chalhoub-Deville, M. (1995a). Deriving oral assessment scales across different tests and

rater groups. Language Testing, 12, 16–33.Chalhoub-Deville, M. (1995b). A contextualized approach to describing oral language

proficiency. Language Learning, 45, 251–281.Chalhoub-Deville, M. (1997). Theoretical models, assessment frameworks and test

construction. Language Testing, 14, 3–22.Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future

trends. Language Testing, 20, 369–383.Chalhoub-Deville, M. & Deville, C. (2005). A look back at and forward to what language

testers measure. In E. Hinkel (Ed.), Researching pedagogic tasks: Second languagelearning, teaching, and testing (pp. 815–832). Harlow, England: Longman.

Chalhoub-Deville, M. & Tarone, E. (1996). What is the role of specific contexts in second-language acquisition, teaching, and testing? Unpublished manuscript.

Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L.F. Bachman & A. D. Cohen (Eds.), Second language acquisition and language testinginterfaces (pp. 32–70). Cambridge: CUP.

Chapelle, C. A. (1999). Validity in language assessment. In W. Grabe (Ed.), Annual Reviewof Applied Linguistics (pp. 254–272). Cambridge: CUP.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational Measurement(2nd ed., pp. 443–507). Washington, DC: American Council on Education.



Cronbach, L. J. (1989). Construct validation after thirty years. In R. E. Linn (Ed.),Intelligence: Measurement theory and public policy (pp. 147–171). Urbana, IL: Universityof Illinois Press.

Deville, C. & Chalhoub-Deville, M. (1993). Modified scoring, traditional item analysis andSato’s caution index used to investigate the reading recall protocol. Language Testing,10, 117–132.

Douglas, D. (2002). Assessing language for specific purposes. Cambridge: CUP.Ellis, R. (1994). The study of second language acquisition. Oxford: OUP.Feldt, L. S. (2002). Estimating internal consistency reliability of a tests composed of testlets

varying in length. Applied Measurement in Education, 15, 33–48.Feldt, L. S. & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement

(3rd ed., pp. 105–146). New York: American Council on Education and Macmillan.Feldt, L. S. & Charter, R. A. (2003). Estimating the reliability of a test split into two parts of

equal or unequal length. Psychological Methods, 8, 102–109.Fiske, D. W. (2002). Validity for what? In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.), The

role of constructs in psychological and educational measurement (pp. 169–178). Mahwah,NJ: Lawrence Erlbaum.

Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on thefrequency of use of various types. Educational and Psychological Measurement, 60, 523–561.

Johnson, M. (2001). The art of nonconversation: A re-examination of the validity of the oralproficiency interview. New Haven, CT: Yale University Press.

Kramsch, C. (1986). From language proficiency to interactional competence. The ModernLanguage Journal, 70, 366–372.

Lado, R. L. (1961). Language testing: The construction and use of foreign language tests: Ateacher’s book. New York: McGraw-Hill.

Lee, G. (2002). The influence of several factors on reliability for complex readingcomprehension tests. Journal of Educational Measurement, 39, 149–164.

Lee, G., Brennan, R. L., & Frisbie, D. A. (2000). Incorporating the testlet concept in test scoreanalyses. Educational Measurement: Issues and Practice, 19, 9–15.

Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eightmeasurement models for analyzing test composed of testlets. Educational andPsychological Measurement, 61, 958–975.

Lee, Y-W., Kantor, R., & Mollaun, P. (April, 2002). Score dependability of the writing andspeaking sections of New TOEFL. Paper presented at the annual meeting of the NationalCouncil on Measurement in Education (NCME), New Orleans, LA.

Linn, R. L. & Burton, E. (1994). Performance-based assessment: Implications of taskspecificity. Educational Measurement: Issues and Practice, 13(1), 5–8, 15.

Magnusson, D. (Ed.). (1981). Toward a psychology of situations. Hillsdale, NJ: LawrenceErlbaum.

Magnusson, D. (1981). Wanted: A psychology of situations. In D. Magnusson (Ed.),Toward a psychology of situations: An interactional perspective (pp. 9–35). Hillsdale, NJ:Lawrence Erlbaum.



Magnusson, D. (2002). The individual as the organizing principle in psychological research:A holistic approach. In L. R. Bergman, R. B. Cairns, L.-G. Nilsson, & L. Nystedt (Eds.),Developmental science and the holistic approach (pp. 33–47). Mahwah, NJ: LawrenceErlbaum.

McNamara, T. F. (1996). Measuring second language performance. London: Longman.McNamara, T. F. (2001). Language assessment as social practice: Challenges for research.

Language Testing, 18, 333–349.Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–

103). New York: American Council on Education and Macmillan.Moss, P. A. (1992). Shifting conceptions of validity in educational measurement:

Implications for performance assessment. Review of Educational Research, 62, 229–258.Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2),

5–12.Nesselroade, J. R. & Ghisletta, P. (2000). Beyond static concepts in modeling behavior. In L.

R. Bergman, R. B. Cairns, L.-G. Nilsson, & L. Nystedt (Eds.), Developmental science andthe holistic approach (pp. 121–135). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Oller, J. W., Jr. (1979). Language tests at school: A pragmatic approach. London: Longman.Omaggio, A. C. (1986). Teaching language in context: Proficiency-oriented instruction. Boston,

MA: Heinle & Heinle Publishers.Parkes, J. (2001). The role of transfer in the variability of performance assessment scores.

Educational Assessment, 7, 143–164.Qualls, A. L. (1995). Estimating reliability of a test containing multiple item formats. Applied

Measurement in Education, 8, 111–120.Shavelson, R. J., Baxter, G. P., & Goa, X. (1993). Sampling variability of performance

assessments. Journal of Educational Measurement, 30, 215–232.Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11,

99–123.Snow, R. E. (1994). Abilities in academic tasks. In R. J. Sternberg & R. K. Wagner (Eds.),

Mind in context (pp. 3–37). Cambridge: CUP.Sternberg, R. J. & Wagner, R. K. (Eds.). (1994). Mind in context. Cambridge: CUP.Swain, M. (1993). Second language testing and second language acquisition: Is there a

conflict with traditional psychometrics? Language Testing, 10, 193–207.Swain, M. (2001). Examining dialogue: Another approach to content specification and to

validating inferences drawn from test scores. Language Testing, 18, 275–302.Tarone, E. (1988). Variation in interlanguage. London: Edward Arnold.Valsiner, J. (1984). Two alternative epistemological frameworks in psychology: The

typological and variational modes of thinking. The Journal of Mind and Behavior, 5,449–470.

Young, R. F. (March, 2000). Interactional competence: Challenges for validity. Paperpresented at the annual meeting of the Language Testing Research Colloquium (LTRC),Vancouver, Canada.

Yu, C. H. (April, 2003). Misconceived relationships between logical positivism andquantitative research. Paper presented at the annual meeting of the AmericanEducational Research Association (AERA), Chicago, IL.


Validity and values

Inferences and generalizability in language testing

Tim McNamaraThe University of Melbourne

Language testing research is an increasingly divided field, as it responds to theparadigm shifts in broader applied linguistics research. On the one hand,language testing validation research places a fundamental emphasis on thegeneralizability of results and the appropriateness of inferences made basedon observed learner performances. This involves a rigorous interrogation ofthe elicitation instruments, judgments, and observations used to makeinferences about individual test takers. At the same time, input fromnon-measurement traditions is leading to the exploration of new insightsinto the limitations of such inferences, and to a greater understanding of thesocial values which imbue tests. This epistemological ferment is as muchproductive as problematic.

Introduction

In this chapter, I want to review recent work within language testing aboutthe issue of generalizing from learner performances, that is, drawing inferencesabout the test-taker’s ability beyond the immediate testing context. Of all thefields of applied linguistics, it is in language testing that questions of inferencesand generalizability are arguably of most fundamental concern. Language test-ing researchers (e.g. Bachman 1990; Bachman 2005; Chapelle 1998; Chapelleet al. 2004; McNamara 2003), drawing on work in educational measurement(especially Messick 1989; Mislevy 1996; Kane 1992, 2004), have long arguedthat test scores represent inferences or generalizations about individual test-takers, and discuss procedures for investigating their validity. There are tworelated aspects of these discussions of generalizability from test performancewhich I would like to address in this chapter. The first is the way in which wemay assemble reasoned arguments and empirical evidence in support of the


Tim McNamara

validity of inferences from test scores. The second is about the challenge to theprocedures typically used for gathering such evidence presented by critiquesof positivism within the social sciences and applied linguistics. These critiquesrequire us to consider to the values implied within test constructs, and the so-cial and political context of the uses made of test scores. In this latter domain,the discussions within language testing mirror the broader ‘paradigm debate’about the epistemology of research in other fields of applied linguistics.

Drawing generalizable inferences in language assessment

Language tests are procedures for generalizing. Scores on language tests arereports of performance in the test situation, but they only have meaning asinferences beyond that situation. The basic logic of generalizing from observedperformance under test conditions to performance under non-test conditions,which is the real target of the assessment, can be set out as in Figure 1.

The first step is to identify the assessment target – the inferences we wouldlike to draw about learners: what they know, what they can do, what perfor-mances they are capable of beyond the assessment setting. This target of testinferences is known as the criterion domain. This domain cannot be ‘known’in any direct sense, but only surmised indirectly, ‘through a glass, darkly’, in anassessment version of Labov’s Observer’s Paradox, or Heisenberg’s UncertaintyPrinciple.

Now, as daily experience proves, inferences and surmises can be rather wideof the mark. In order to constrain test inferences, careful attention is paid to

Criteriondomain

Construct Test

domain ofreal-world

performance,knowledge,

capacity

characterizationof claimed

essential featuresof performance;

theory of domain

� �test designtest performances;

item responses

Target(unobservable)

� about ()

via theoreticalmodel

� inferences Observations/Evidence

Figure 1. Criterion, construct and test


Validity and values

thinking through exactly what it is that we want to infer, and developing pro-cedures for gathering the information that forms the basis for test inferences.This involves two stages.

First, the criterion domain is modelled; this is the assessment construct. Theconstruct is always in a sense provisional, and is subject to ongoing revision,as understanding of the domain, the nature of language, language use, and oflearning, changes over time. The source of domain modelling for the purposeof assessment may be the curriculum, which itself constitutes a model, a viewof the target domain, its structure and principal divisions and characteristics,and assessments may accordingly draw on curriculum constructs as the basisfor interpreting the data of learner performance. But there are other sources,in psycholinguistics, in sociolinguistics, in discourse theory. Such constructsembody social, educational and political values, and are never neutral; theyare constantly subject to intellectual contestation, and to social and politicalinfluence. I will say more about constructs in a moment.

Second, the construct informs the development of a procedure for princi-pled observation of performance under known conditions. This is the assess-ment or, if it is done formally, the test. Ideally, the tester is now in a positionto draw inferences on the basis of these observations either about the probablecharacter of performance under non-test conditions, or about the candidate’sstanding in relation to a domain of knowledge and abilities of interest.

In summary, then, crucial to assessment is the distinction between the cri-terion domain (the target of test inferences) and the test (the evidence fromwhich inferences are drawn). And because these inferences are necessarily in-direct, they are mediated through test constructs, that is, modelling of thecriterion domain in terms of what are held to be its essential features or char-acteristics.

Validating test score inferences

The inferences drawn in language tests, given their inevitable uncertainty, areopen to challenge. Much research in language testing is motivated by a scepti-cism in principle about the meaningfulness and defensibility of the inferenceswe make based on test performance.1 This notion is the crucial feature of theclassic validity framework of Messick (1989) (Figure 2), and of subsequent de-velopments of it in the work of Mislevy (1996), Kane (1992, 2004) and others.

Messick distinguishes a number of facets of validity within a unified the-ory of validity. The first cell in the matrix emphasises the need to use reasoning


Tim McNamara

Test interpretation Test use

Evidential basis Construct validity Construct validity + Relevance / utilityConsequential basis Value implications Social consequences

Figure 2. Facets of validity (Messick 1989:20)

and empirical evidence in support of the constructs on which test score inter-pretations are based. The second cell on the top line stresses the point thatsuch inferences are never context free, and need to be considered in the contextin which the assessment is to be used. Test score inferences need to be revali-dated for every major context of use. For example, this should make us questionthe current use of scores on IELTS (a test of English for academic and generaltraining purposes) in order to determine language proficiency in the context ofimmigration selection in Australia and other countries, a use for which IELTShas not to date been validated. The third and fourth cells take us into the realmof the values implicit in test construct, and the social consequences of test use;we will discuss these below.

Constructs and evidence as the basis for test score inferences

As stated above, reasoning about the construct is the first stage of test devel-opment and the basis of much critique of the validity of the inferences thatare drawn from test performance. Usually, this reasoning draws on theoriesof language knowledge and language performance which are essentially cog-nitive, so that the inferences drawn are about the cognitive characteristics oftest takers. But constructs can also be social in character, so that test perfor-mance may be the basis for drawing inferences about the social characteristicsof test takers – their social identity. This is an important function of languagetests, familiar since the shibboleth test of the Bible. Let me give you a recentexample. For the past 15 years, and increasingly, the claims of those seekingrefugee status in a number of countries including the U.K., many other Euro-pean countries, Australia and New Zealand, have been determined in part onthe basis of a language test. For example, one large group of refugee claimantsrecently arriving in Australia and other countries is composed of members ofthe Hazara minority from Afghanistan, who have been subject to persecutionin Afghanistan on religious and ethnic grounds for many years. Hazaras speaka dialect of Dari, one of the two main languages of Afghanistan, itself close to


Validity and values

the Farsi spoken in Iran. The applications of these claimants are routinely sub-ject to query on the grounds that the individuals making the claims may not befrom Afghanistan but from established Hazara communities in neighbouringcountries such as Pakistan or Iran, where they are not subject to persecutionand thus under international law have no right to refugee status when theyarrive in another country. The refugees are interviewed by an immigration of-ficer through an interpreter, and the tape of the interview is sent to so-calledlanguage experts employed by private companies (as it happens, based mainlyin Sweden). On the basis of the language evidence on the tape, the ‘experts’draw their inferences: they claim to be able to tell, based on features of lex-ical choice and pronunciation, whether the person is a Hazara from insideAfghanistan, or from Pakistan or Iran. Leaving aside for a moment the compe-tence and the methods of the ‘experts’ involved, the validity of these inferencesis rendered questionable by the problematic nature of the construct underly-ing the test. The issue is that the boundaries of the linguistic community ofspeakers of the Hazara dialect do not coincide with the political border be-tween Afghanistan and Pakistan, and there is simply not the sociolinguisticinformation available to determine the issue accurately. Obviously, given theconditions in Afghanistan over the last thirty years, there has been no propersociolinguistic work on the varieties concerned and their geographical distri-bution; and the disruption caused by the years of war, the refugee flows, theinfluence of teachers who are speakers of other varieties, and so on mean thatsociolinguistically the situation is likely to be very fluid. We can thus object tothe procedure, and to the quality of the inferences to which it leads, by arguingabout the construct, as a group of sociolinguists, phoneticians and languagetesters have recently done (Eades et al. 2003).

Most language testing research is about the evidential basis for test scoreinterpretation and use. In the case of the asylum seekers, for example, empir-ical evidence of the extent of false negative and false positive identificationshas been gathered in the form of data on asylum seekers who have been re-jected by countries to which they have been sent ‘home’ – 20% of cases, in oneSwedish report. This empirical evidence thus confirms the earlier theoreticaldoubts on the validity of inferences drawn on the basis of the procedure. Moretypically, empirical evidence is sought from analysis of responses by candidatesto test items. Does the evidence from these responses support the structuralrelationships predicated in the construct? Particularly helpful here is Messick’snotion of construct-irrelevant variance. For example, if the chances of a scoreon an oral proficiency test depend in part on the luck of the draw in termsof which interviewer interacts with the candidate (Brown 2003), then part of


Tim McNamara

the score variance will reflect characteristics of the interlocutor rather than thecandidate, with subsequent threats to the generalizability of inferences aboutcandidates. An alternative way of understanding this issue is to see it as in-volving construct definition. Candidate performance in this case involves jointaction on the part of both candidate and interviewer, and cannot therefore beseen as a simple projection of the individual candidate’s ability (McNamara1997; Chalhoub-Deville 2003).

Note that this investigation of the relationship between test construct andtest performance cuts both ways: test data can not only be used to validate,in other words to question, the inferences that we draw based on test con-structs, but can also be used to question the constructs on which our inferenceshave been based. In this way, just as SLA is a source of constructs for languagetesting, language testing is a site for research in second language acquisition.The study reported in Iwashita et al. (2001) is a case in point. The researchinvolved attempting to build a task-based speaking test drawing on the frame-work proposed by Skehan (1998). The study demonstrated not only that itwas not possible to build a test of speaking that would yield valid scores byusing the framework in this context, but that the test data raised interestingquestions about the model itself. In other words, in the attempt to validatescore inferences from the test, validity evidence relevant to the construct itselfwas gathered.

Recently, other validity theorists have operationalized the aspects ofMessick’s validity framework discussed so far. This work has resulted in twoaccomplishments. First, in the work of Mislevy (1996), it stresses the logicalsteps necessary for validation work, and offers conceptual clarification of thedesign of validity studies. Second, in the work of Kane (1992, 2004), it estab-lishes a more manageable process of validation, made necessary by the factthat the very complexity of the validation issues raised by Messick threatens tooverwhelm test developers.

Mislevy (1996; Mislevy et al. 2002, 2003) proposes what he calls evidencecentred test design, in which he distinguishes the claims we wish to make abouttest-takers (the inferences we wish to be able to draw about them), what wewould need to be able to observe about test takers in order to form a basisfor drawing those inferences, and the conditions under which this evidencemight be sought. The logical analysis so entailed of claims, evidence and taskdesign leads then to the design of empirical procedures for gathering data fromtest performance and analysing it statistically in order ultimately to validatethe claims.


Validity and values

Kane (1992, 2004; Kane et al. 1999) proposes the need to develop a valida-tion argument, with carefully defined stages involving procedures for generaliz-ing from the evidence of test performance to the inferences one wishes to drawabout candidates, and for envisioning the main threats to the validity of thoseinferences in order to design appropriate studies to investigate such threats.

Generalization as reliability

The centrality of generalizability in language testing is underlined by the factthat we are not interested in the test performance for its own sake: it has nomeaning or value in and of itself. It has meaning only as a basis for generalizingbeyond the particular performance. Raters scoring an essay written for a lan-guage test are rarely interested in the content of the essay as such; this writtencommunication between an anonymous candidate and an anonymous rater isa performance for a further purpose. It is a deliberate display of abilities onthe part of the candidate in accordance with the rules for display set up by thetester so that the abilities can be assessed. In a speaking test, the interlocutor isnot necessarily interested in what the candidate has to say in itself; or if they dohappen to find it personally interesting, this is at best irrelevant, and at worst asource of construct-irrelevant variance! But there is one further specific issue ofgeneralizability which is routinely addressed in the analysis of test scores, andthat is the question of the repeatability of the inference drawn. This generallyinvolves the need to establish the reliability of the procedure. In tests involvinga series of discrete items, as commonly found in listening, reading, grammaror vocabulary tests, the response to each item can be considered as a separateperformance, and investigation of reliability involves checking to see whetherthe inferences about candidates drawn from performances on individual itemsor sets of items support one other, that is, whether items ostensibly testing sim-ilar things yield similar inferences about the candidate. For tests involving setsof items, such as most grammar, vocabulary, listening or reading tests, this isdone technically by looking at the discrimination of each item, and by aggre-gating item discrimination indices as a measure of overall reliability. In tests ofproductive performance such as tests of speaking and writing, scoring involveshuman judgement, and issues of repeatability here involve estimating the ex-tent of agreement between judges – in other words, would the candidate getthe same score next time with a different judge?2


Tim McNamara

Epistemology and research methodology in investigating the validityof test score inferences

Addressing the evidence-related issues outlined above typically involves statis-tical methods, which have become increasingly sophisticated; a helpful reviewof these developments is available in Bachman (2000). They include Gen-eralizability Theory, Item Response Modelling, including Rasch modelling,Structural Equation Modelling and others. But the use of such psychometrictechniques in language testing invites charges that language testing research isirredeemably positivist, and thereby unable to respond to the challenge of the‘paradigm debate’ about the epistemology of research in the wider arena ofthe social sciences. The epistemological assumptions of positivism and post-positivism, critiques of them, and alternative research paradigms, are helpfullyset out by Brian Lynch in his book Language Program Evaluation (Lynch 1996).Lynch frames this debate in part in terms of competing notions of the validityof inferences in research, and in particular whether, and if so how, we are togeneralize from research data. Clearly, the very themes of this volume – issuesof inferencing and generalizability – have been triggered by and are reflectiveof this wider debate. Within applied linguistics this debate has been evidentin heated, even acrimonious exchanges over conflicting epistemologies for car-rying out research on Second Language Acquisition (Block 1996; Gregg et al.1997; Firth & Wagner 1997).

Research on language testing has for some time similarly reflected thisparadigm debate. Brian Lynch and Liz Hamp-Lyons (Lynch & Hamp-Lyons1999; Hamp-Lyons & Lynch 1998) have examined the epistemological basesfor research papers given at the Language Testing Research Colloquium over anumber of years, and detected a trend away from hard positivism, although thefield is still one of the most positivist in orientation of all areas of Applied Lin-guistics. One common way of interpreting the issues in the debate is in termsof competing research methodologies (quantitative or qualitative). A growingpreference for qualitative research methods in applied linguistics is reflected inthe changes over the years in the relative proportions of qualitatively and quan-titatively based research papers in applied linguistics journals, as tracked byLazaraton (2000) and others. This trend is also evident within language testing,as Lynch and Hamp-Lyons show (see also Lumley & Brown 2004).

The use of qualitative methodologies is found in many test validation con-texts, for example the use of introspective methods in research on multimedialistening and viewing (Gruba 1999), listening (Buck 1991; Yi’an 1998), the rat-ing of writing (Lumley 2000), and many others. A recent survey of discourse


Validity and values

Table 1. Qualitative research on the validation of oral tests

Research Topic Aspectmethods used

Activity Theory Oral Proficiency Interview Institutional character of eventConversation AnalysisConversation AnalysisDiscourse Analysis

Interlocutor effects intraditional proficiencyinterviews

Characterisation of interlocutor skillImpact of genderImpact of familiarity of interlocutorImpact of native speaker status of in-terlocutor

Discourse AnalysisActivity Theory

Interlocutor effects inpaired and group oral tests

Impact of personalityImpact of proficiency

Discourse AnalysisInterviews/retrospectiverecall

Effect of Task Role play versus interview‘Authentic’ interactions with nativespeakersMonologue vs dialogueFunctions elicitedInformation gap tasksTask performance conditions (Skehan)

Conversation AnalysisDiscourse Analysis

Test taker characteristics Impact of genderImpact of cultural background

Think aloudDiscourse Analysis

Rating scales and ratercognition

Interpretation of criteriaEvolution of criteriaDevelopment of rating scales

Discourse Analysis Classroom assessment Formative assessmentDiscourse AnalysisCase studies

The use of technology inspeaking tests

Semi-direct tests of speaking (SOPI)

based studies of oral language assessment (McNamara, Hill, & May 2002) iden-tified the extent of qualitative research methods used in such research. Table 1sets out the qualitative research methods used and the topics dealt with in thesestudies. Reference details for the research listed here are available in McNamara,Hill and May (2002).

As well, significant studies of the validity of oral language tests have used acombination of quantitative and qualitative methods. For example, O’Loughlin(2001) combines Rasch analysis, lexical density counts of candidate discourse,and case studies of individual candidates to establish his claim about the lackof comparability of two supposedly equivalent instruments, direct and semi-direct measures of oral proficiency. Iwashita et al. (2001) use a combinationof Rasch analysis and careful study of candidate discourse to reach importantconclusions about the applicability of Skehan’s model of task performance con-ditions to the definition of dimensions of task difficulty in a speaking test.


Tim McNamara

Brown (2003) combines Conversation Analysis and Rasch analysis of scoresto illuminate the character of interlocutor behaviour in speaking tests and itsimpact on scores.

The choice of qualitative or quantitative research methods in validationstudies of this kind does not appear then to involve a difficult epistemologicalchoice. It is fair to characterize all such research aiming at providing evidencefor or against the legitimacy of inferences from test scores, both quantitativeand qualitative, as empiricist. This commonality is I think more significantthan any differences or preference for qualitative over quantitative methodson a priori epistemological grounds.

Values and consequences in language tests

A more profound epistemological challenge within language testing research isassociated with a deepening awareness of the social and political values embod-ied both in language constructs and in language testing practice. Let us returnto Messick’s validity matrix (Figure 2, above).

The bottom row of the matrix insists that all test constructs, and henceall interpretations of test scores, involve questions of value. Messick in otherwords sees all testing as fundamentally political in character – by that I meanthat its constructs and practices are located in the arena of contestable val-ues. This requires us to reveal and to defend the value implications of ourtest constructs (cell 3), and to consider the impact of tests, and in particular,the wanted and unwanted consequences of interpretations about test-takers incontexts of test use (cell 4). Messick (1989) locates these requirements in an ex-tensive discussion of the epistemology of measurement research and practice,and demonstrates a deep awareness of the critiques of positivism. Moss (1998)similarly locates her concern for test consequences in a discussion of the workof Foucault, Bourdieu and other critics of the positivism of the social sciences.

The educational measurement field, like language testing, respondedstrongly to the idea that exploration of the consequences of the use of lan-guage test scores be seen as part of test score validation. While many figuresendorsed this position (Shephard 1997; Linn 1997; Kane 2001), pointing outthat validation of the use of test scores had been a central theme of the discus-sion of validity theory for over 40 years, others (Popham 1997; Mehrens 1997)objected on the grounds that it was not helpful to see this as part of valida-tion research, which they wanted to restrict to investigation of the accuracy oftest score interpretation. In some ways, however, this is a dispute more over


Validity and values

wording than substance, as both Popham and Mehrens agree that the conse-quences of test use are a subject requiring investigation; they simply do notwant to call it ‘validation’ research. In fact, Popham has recently published abook (Popham 2001) on what he sees as the damaging consequences of schooland college testing in the United States.

The response of the field of language testing to Messick’s notion of thevalue-laden character of assessment was slow at first. This was in part due tothe fact that this aspect of the work was relatively underplayed in the influentialoriginal presentation of Messick’s ideas in the work of Bachman (1990), whoperhaps understandably emphasized the important implications for languagetesting research of the top two cells of the matrix.

In terms of the values implied in language test constructs, there is a growingcritique of individualistic models of proficiency (McNamara 1997; Chalhoub-Deville & Deville 2004). A strong influence here has been work on discourseanalysis which stresses the co-construction of performance (Jacoby & Ochs1995:177):

One of the important implications for taking the position that everything isco-constructed through interaction is that it follows that there is a distributedresponsibility among interlocutors for the creation of sequential coherence,identities, meaning, and events. This means that language, discourse and theireffects cannot be considered deterministically preordained . . . by assumedconstructs of individual competence. . .

This work has been most advanced in the area of oral language proficiencytesting in face-to-face contexts, such as in the oral proficiency interview (e.g.Brown 2003) or paired-candidate interactions (e.g. Iwashita 1998). In eachcase, the performance of the interlocutor (interviewer, other candidate) asimplicated in the performance of the candidate poses difficult questions forassessment schemes which are framed in terms of reports on individual com-petence. In the words of Schegloff (1995:192):

It is some 15 years now since Charles Goodwin . . . gave a convincing demon-stration of how the final form of a sentence in ordinary conversation had tobe understood as an interactional product. . . Goodwin’s account . . . serves. . .as a compelling call for the inclusion of the hearer in what were purported tobe the speaker’s processes.

Recent work has drawn attention to the potential of poststructuralist thoughtin understanding how apparently neutral language proficiency constructsare inevitably socially constructed and thus embody values and ideologies(McNamara 2001, 2006). It is worth noting here that the deconstruction of


Tim McNamara

such test constructs applies no less to constructs in other fields of appliedlinguistics, notably second language acquisition.

There is also a growing realization that many language test constructs areexplicitly political in character and hence not amenable to influences which arenot political (Brindley 1998, 2001; McNamara 2006). Examples are the con-structs embedded in competency based language assessments used in adultvocational education and training as a direct result of government policy. InAustralia, the ESL assessment of adult immigrants via the Certificate in Spokenand Written English (CSWE: Burrows 1993), which ostensibly is based looselyon a Hallidayan model of language use, is better understood as an expression ofthe outcomes based and functionalist demands of current government policyon adult education and training. A more recent example is the extraordinaryrole within Europe (and beyond) of the Common European Framework forLanguages (Council of Europe, 2001). The political imposition of languagetesting constructs means that language test validation research will have an im-pact only to the extent that it is politicized. This is at odds with the liberalnotion that academic research on language test validity will have an impact inand of itself, that is, through the sheer power of its reasoning. Thus it is possi-ble that what is most significant about the Code of Ethics of the InternationalLanguage Testing Association (ILTA) (ILTA, 2000) is its political character, thepower it may have institutionally to affect practice, rather than its inherentmoral persuasiveness. ILTA itself is a potentially political force, although thispotential is far from being maximized at present.

The relative impotence of scholars is not of course a new story. In a dis-cussion of the role of Confucian thought in the Han Dynasty in China in the3rd century BC, the late eminent Harvard historian John King Fairbank writes(Fairbank & Goldman 1998:63):

As W. T. de Bary (1991) points out, the Confucians did not try to establish ‘anypower base of their own. . . they faced the state, and whoever controlled it inthe imperial court, as individual scholars. . . this institutional weakness, highlydependent condition, and extreme insecurity. . . . Marked the Confucians asju (‘softies’) in the politics of imperial China.’ They had to find patrons whocould protect them. It was not easy to have an independent voice separate fromthe imperial establishment.

These explicitly political considerations bring us to the implications of thenotion of test consequences. Returning to the example of the asylum seekersgiven earlier in this paper, a concern for language test consequences meansthat even were the procedure currently used to be carried out on the basis of


Validity and values

a more adequate and defensible sociolinguistic construct, it necessarily has se-rious consequences for those concerned, and raises complex issues of ethics.The ethical responsibilities of language testers as an aspect of their profession-alism is now a major topic of debate within the profession, for example in thework of Spolsky (1981, 1997), Hamp-Lyons (1997), Kunnan (1997, 2000), andparticularly Davies (1997, 2004).

A more radical response to the issue of the consequences of test use in-volves direct political critique of the institutional character of assessment. Mostsignificant here is the work of Elana Shohamy (1998, 2001a, 2001b; see alsoLynch 2001), who, following Pennycook (2001), has introduced the term crit-ical language testing to describe a research orientation committed to exposingundemocratic practices in assessment. This important work is still in its earlystages. An excellent example of the potential for a kind of Foucauldian archivalresearch on language tests is the account in Spolsky (1995) of the politicaland institutional forces surrounding the initial development and struggle forcontrol of the Test of English as a Foreign Language (TOEFL).

A growing awareness of the largely managerial and institutional functionof language assessment has been accompanied by an increasing sense of theneglect by assessment research of the needs of teachers and learners as they en-gage with the process of learning. This neglect is evident in the ongoing focuson high stakes proficiency assessment as the main staple of academic researchon language testing, for example as measured by papers in Language Testing orpapers given at the Language Testing Research Colloquium (LTRC), althoughrecent years have seen a welcome correction to this picture, with a special is-sue of Language Testing devoted to this subject, and a strong thematic focuson classroom assessment at the 2004 LTRC. Particularly problematic is the factthat the very notion of generalizability of assessment, so fundamental (as wehave seen) to current discussions of validity, can sometimes compromise thepotential of assessment to serve the needs of teachers and learners. The needfor assessments to be comparable, the basis for reliability, one aspect (as wehave seen) of generalizability, places constraints on classroom assessment, par-ticularly in the area of communicative skill (speaking and writing), that mostteachers cannot meet (Brindley 2000). Moreover, any attempt to meet suchdemands constrains the possibilities of assessment in the classroom. We ur-gently need to explore ways in which assessment concepts can be better used inclassrooms to benefit learning and teaching (Leung 2004).

In conclusion, I think that language testing researchers should welcomeand participate vigorously in the debates about epistemology and researchparadigms as relevant to language testing. Engagement with debate of this kind


Tim McNamara

is precisely the sort of interrogation encouraged by validity theory as envis-aged by Messick in particular. It will be clear that I think that we should notbe constrained from embracing the full range of research methods open tous, be they neo-positivist or otherwise, if they deepen (as I believe they do)our ability to understand the bases for inferences in language tests. The greaterchallenge, I believe, is to use a recognition of the value-laden and political char-acter of language tests to re-think their deeper meaning and social function.But this will not be easy, if the work of some of Messick’s heirs is an indica-tion. Mislevy’s otherwise brilliant work at no time engages with the issue ofthe consequences of test use. And while Kane does in principle embrace a con-cern for consequences, he does not develop a methodology for reflecting onthem or investigating them other than through the general structure of thevalidation argument he proposes, which reduces the issue to a technical andempirical one. This narrow position is reflected in a recent paper of Bachman(2005), who, similarly concerned with the manageability of the validation task,proposes that test use consequences be investigated through a second, parallelvalidation argument modelled on that used to validate test score interpretationas it were outside a context of use. I would argue that the focus of our con-cern about the consequences of test score use and the values implied in testingpractice needs to be much broader. Here I will end with the words of Moss(1998:11). In discussing the social impact of assessment practice, she says:

This perspective on the dialectical relationship between social reality and ourrepresentation of it has implications for understanding the crucial role ofevidence about consequences in validity research. . .. The practices in whichwe engage help to construct the social reality we study. While this may notbe apparent from the administration of any single test, over time the effectsaccumulate. . . . While this argument has some implications for validity the-ory as it relates to specific interpretations and uses of test scores, the importspills over. . . to encompass the general practice of testing. . . . The scope of the[validity] argument goes well beyond . . . test specific evaluation practices; itentails ongoing evaluation of the dialectical relationship between the prod-ucts and practices of testing, writ large, and the social reality that is recursivelyrepresented and transformed.

Notes

. It must be said that there are schools of thought in language testing which don’t alwaysreflect such scepticism: three come to mind:


Validity and values

– the believers in direct testing, such as many of the proponents of the ACTFL scales (Liskin-Gasparro 1984; Lowe 1985), and in general those advocating what Bachman (1990) callsthe ‘real life’ approach;

– many of those promoting a scale and framework approach to language testing, such as theCouncil of Europe’s Common European Framework of Reference for Languages (Coun-cil of Europe, 2001) where concerns about empirical validation of the scales in question(North & Schneider 1998) are matched, if not surpassed, by efforts to negotiate theirpolitical acceptability (Trim 1997);

– the small but growing influence of systemic functional linguistics in language assessment(Matthiessen, Slade, & Macken 1992; Mincham 1995; South Australian Curriculum andStandards Authority, 2002), which stresses the ‘objectivity’ of the assessment as a matterof principle rather than of empirical investigation.

. The demand for reliability proves, however, to be problematic in certain key assessmentcontexts, particularly the classroom, a point to be addressed near the end of the paper.

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OUP.Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that

what we count counts. Language Testing, 17(1), 1–42.Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment

Quarterly, 2(1), 1–34.Block, D. (1996). Not so fast: Some thoughts on theory culling, relativism, accepted findings

and the heart and soul of SLA. Applied Linguistics, 17(1), 63–83.Brindley, G. (1998). Outcomes-based assessment and reporting in language programs: A

review of the issues. Language Testing, 15, 45–85.Brindley, G. (2000). Task difficulty and task generalisability in competency-based writing

assessment. In G. Brindley (Ed.), Issues in immigrant English language assessment.Volume 1 (pp. 45–80). Sydney: National Centre for English Language Teaching andResearch, Macquarie University.

Brindley, G. (2001). Outcomes-based assessment in practice: Some examples and emerginginsights. Language Testing, 18(4), 393–407.

Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.Language Testing, 20(1), 1–25.

Buck, G. (1991). The testing of listening comprehension: An introspective study. LanguageTesting, 8(1), 67–91.

Burrows, C. (1993). Assessment guidelines for the Certificate in Spoken and Written English.Sydney: New South Wales Adult Migrant Education Service.

Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and futuretrends. Language Testing, 20(4), 369–383.


Tim McNamara

Chalhoub-Deville, M. & Deville, C. (2004). A look back at and forward to what languagetesters measure. In T. McNamara, A. Brown, L. Grove, K. Hill, & N. Iwashita (SectionEds.), Second language testing and assessment, Part VI of E. Hinkel (Ed.), Handbook Ofresearch in second language teaching and learning. Mahwah, NJ: Lawrence Erlbaum.

Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition andlanguage testing research (pp. 32–70). New York: CUP.

Chapelle, C. A., Enright, M. E., & Jamieson, J. (2004). Issues in developing a TOEFL validityargument. Paper presented at the Language Testing Research Colloquium, Temecula,CA, March.

Council of Europe (2001). A common European framework of reference for language learning.Cambridge: CUP.

Davies, A. (Ed.). (1997). Special issue: Ethics in language testing. Language Testing, 14(3).Davies, A. (Ed.). (2004). Special issue: The ethics of language assessment. Language

Assessment Quarterly, 2&3.Eades, D., Fraser, H., Siegel, J., McNamara, T., & Baker, B. (2003). Linguistic identification in

the determination of nationality: A preliminary report. Language Policy, 2(2), 179–199.Fairbank, J. K. & Goldman, M. (1998). China: A new history (enlarged edition). Cambridge,

MA and London: Belknap Press of Harvard University Press.Firth, A. & Wagner, J. (1997). On discourse, communication, and (some) fundamental

concepts in SLA research. Modern Language Journal, 81, 285–300.Gregg, K. R., Long, M. H., Jordan, G., & Beretta, A. (1997). Rationality and its discontents

in SLA. Applied Linguistics, 18(4), 538–558.Gruba, P. (1999). The role of digital video media in second language listening comprehen-

sion. PhD Dissertation, University of Melbourne.Hamp-Lyons, L. (1997). Ethics in language testing. In D. Corson (Series Ed.) & C. Clapham

(Vol. Ed.), Language testing and assessment: Vol. 7, Encyclopedia of language andeducation (pp. 323–333). Dordrecht: Kluwer.

Hamp-Lyons, L. & Lynch, B. K. (1998). Perspectives on validity: A historical analysis oflanguage testing conference abstracts. In A. J. Kunnan (Ed.), Validation in languageassessment: Selected papers from the 17th Language Testing Research Colloquium, LongBeach (pp. 253–276). Mahwah, NJ: Lawrence Erlbaum.

International Language Testing Association (ILTA) (2000). Code of ethics for ILTA. LanguageTesting Update, 27, 14–22.

Iwashita, N. (1998). The validity of the paired interview in oral performance assessment.Melbourne Papers in Language Testing, 5(2), 51–65.

Iwashita, N., McNamara, T. F., & Elder, C. (2001). Can we predict task difficulty in an oralproficiency test? Exploring the potential of an information processing approach to taskdesign. Language Learning, 51(3), 401–436.

Jacoby, S. & Ochs, E. (1995). Co-construction: An introduction. Research on Language andSocial Interaction, 28(3), 171–183.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 12, 527–535.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement,38(4), 319–342.


Validity and values

Kane, M. T. (2004). Certification testing as an illustration of argument-based validation.Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. EducationalMeasurement: Issues and Practice, 18(2), 5–17.

Kunnan, A. (1997). Connecting fairness and validation. In A. Huhta, V. Kohonen, L.Kurki-Suomo, & S. Luoma (Eds.), Current developments and alternatives in languageassessment (pp. 85–105). Jyväskylä, Finland: University of Jyväskylä.

Kunnan A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness andvalidation in language assessment. Selected papers of the 19th Language Testing ResearchColloquium. Orlando, Florida (pp. 1–10). Cambridge: UCLES & CUP.

Lazaraton, A. (2000). Current trends in research methodology and statistics in appliedlinguistics. TESOL Quarterly, 34(1), 175–181.

Leung, C. (2004). Classroom teacher assessment of second language development:Construct as practice. In T. McNamara, A. Brown, L. Grove, K. Hill, & N. Iwashita(Section Eds.), Second language testing and assessment, Part VI of E. Hinkel (Ed.),Handbook of research in second language teaching and learning. Mahwah, NJ: LawrenceErlbaum.

Linn, R. L. (1997). Evaluating the validity of assessments: The consequences of use.Educational Measurement: Issues and Practice, 16(2), 14–16.

Liskin-Gasparro, J. E. (1984). The ACTFL proficiency guidelines: Gateway to testing andcurriculum. Foreign Language Annals, 17(5), 475–489.

Lowe, P. (1985). The ILR proficiency scale as a synthesising research principle: The viewfrom the mountain. In C. J. James (Ed.), Foreign language proficiency in the classroomand beyond (pp. 9–53). Lincolnwood, IL: National Textbook Company,

Lumley, T. (2000). The process of the assessment of writing performance: The rater’sperspective. PhD Dissertation, The University of Melbourne.

Lumley, T. & Brown, A. (2004). Research methods in language testing. In T. McNamara, A.Brown, L. Grove, K. Hill, & N. Iwashita (Section Eds.), Second language testing andassessment, Part VI of E. Hinkel (Ed.), Handbook of research in second language teachingand learning. Mahwah, NJ: Lawrence Erlbaum.

Lynch, B. K. (1996). Language program evaluation. Cambridge: CUP.Lynch, B. K. (2001). Rethinking assessment from a critical perspective. Language Testing,

18(4), 351–372.Lynch, B. K. & Hamp-Lyons, L. (1999). Perspectives on research paradigms and validity:

Tales from the Language Testing Research Colloquium. Melbourne Papers in LanguageTesting, 8(1), 57–93.

McNamara, T. F. (1997). ‘Interaction’ in second language performance assessment: Whoseperformance? Applied Linguistics, 18(4), 446–466.

McNamara, T. F. (2001). Language assessment as social practice: Challenges for research.Language Testing, 18(4), 333–349.

McNamara, T. F. (2006). Validity in language testing: The challenge of Sam Messick’s legacy.Language Assessment Quarterly, 3(1).

McNamara, T. F., Hill, K., & May, L. (2002). Discourse and assessment. Annual Review ofApplied Linguistics, 22, 221–242.


Tim McNamara

McNamara, T. F. (2003). Validity and reliability in the senior school curriculum: Newtakes on old questions. Invited presentation, Australasian Curriculum, Assessment andCertification Authorities (ACACA) 2003 National Conference, Adelaide, July.

Matthiessen, C. M. I. M., Slade, D., & Macken, M. (1992). Language in context: A new modelfor evaluating student writing. Linguistics in Education, 4(2), 173–195.

Mehrens, W. A. (1997). The consequences of consequential validity. Educational Measure-ment: Issues and Practice, 16(2), 16–18.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). New York: American Council on Education and Macmillan.

Mincham, L. (1995). ESL student needs procedures: An approach to language assessmentin primary and secondary school contexts. In G. Brindley (Ed.), Language assessmentin action (pp. 65–91). Sydney: National Centre for English Language Teaching andResearch, Macquarie University.

Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4),379–416.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-basedlanguage assessment. Language Testing, 19(4), 477–496.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of assessmentarguments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

Moss, P. A. (1998). The role of consequences in validity theory. Educational Measurement:Issues and Practice, 17(2), 6–12.

North, B. & Schneider, G. (1998). Scaling descriptors for language proficiency scales.Language Testing 15(2), 217–263.

O’Loughlin, K. (2001). The equivalence of direct and semi-direct speaking tests. Studies inLanguage Testing, 13. Cambridge: UCLES/CUP.

Pennycook, A. (2001). Critical applied linguistics: A critical introduction. Mahwah, NJ:Lawrence Erlbaum.

Popham, W. J. (1997). Consequential validity: Right concern – wrong concept. EducationalMeasurement: Issues and Practice, 16(2), 9–13.

Popham, W. J. (2001). The truth about testing. Alexandria, VA: Association for Supervisionand Curriculum Development.

Schegloff, E. A. (1995). Discourse as an interactional achievement III: The omnirelevance ofaction. Research on Language and Social Interaction, 28(3), 185–211.

Shepard, L. A. (1997). The centrality of test use and consequences for test validity.Educational Measurement: Issues and Practice, 16(2): 5–8, 13, 24.

Shohamy, E. (1998). Critical language testing and beyond. Studies in Educational Evaluation,24, 331–345.

Shohamy, E. (2001a). The power of tests: A critical perspective on the uses of language tests.London: Pearson.

Shohamy, E. (2001b). Democratic assessment as an alternative. Language Testing, 18(4),373–291.

Skehan, P. (1998). A cognitive approach to language learning. Oxford: OUP.South Australian Curriculum and Standards Authority (2002). South Australian Curriculum,

Standards and Accountability Framework (SACSAF) English as a Second Language (ESL)Scope and Scales. Adelaide, SA: Author.


Validity and values

Spolsky, B. (1981). Some ethical questions about language testing. In C. Klein-Braley & D.Stevenson (Eds.), Practice and problems in language testing (pp. 5–21). Frankfurt: PeterLang.

Spolsky, B. (1995). Measured words. Oxford: OUP.Spolsky, B. (1997). The ethics of language gatekeeping tests: What have we learned in a

hundred years? Language Testing, 14(3), 242–247.Trim, J. L. M. (1997). The proposed Common European Framework for the description of

language learning, teaching and assessment. In A. Huhta, V. Kohonen, L. Kurki-Suonio,& S. Luoma (Eds.), Current developments and alternatives in language assessment (pp.415–421). Jyväskylä: University of Jyväskylä and University of Tampere.

Yi’an, W. (1998). What do tests of listening comprehension test? A retrospection study ofEFL test-takers performing a multiple-choice task. Language Testing, 15(1), 21–44.


L2 vocabulary acquisition theory

The role of inference, dependabilityand generalizability in assessment

Carol A. ChapelleIowa State University

Vocabulary acquisition is discussed from an assessment perspective, whichprovides the starting point for examining the concepts of inference,dependability, and generalizability. It argues that the construct of vocabularyneeds to be defined in L2 vocabulary research if inference, dependability andgeneralizability are to have meaning. Inference in this context is clarifiedthrough the distinction drawn between a construct inference linking thetheoretical and operational definitions underlying assessment, and a theoryinference linking results from one study to the theoretical constructframework from which construct definitions are derived. Based on theseconcepts, I suggest that progress seems most evident in recent work onvocabulary that links acquisition, assessment, and instruction.

The title of my paper might give away my perspective. It reflects the ideathat assessment could play a role in the idiosyncratic, gradual, and individ-ual process of L2 vocabulary acquisition and therefore perhaps identifies meas a reliability-checking, truth-seeking logical positivist.1 From this perspec-tive, I would argue that many of the applied problems in language teachingand assessment can best be addressed by theory that is seen as true enoughto be useful. I started looking at L2 vocabulary acquisition theory many yearsago to develop a theory-based set of principles for evaluating constructed vo-cabulary responses on language tests, and have since come back to vocabularyacquisition theory for evaluating materials intended to promote incidental vo-cabulary learning. Both of these applied problems would be more likely to findsolutions from a theory that at least specifies what L2 vocabulary knowledge2

consists of. L2 acquisition theory should also hypothesize how such knowledge


Carol A. Chapelle

is acquired and what constitutes more or less vocabulary knowledge, but suchhypotheses need to be described within a theoretical construct framework forL2 vocabulary knowledge.

When I started looking into these issues, L2 vocabulary acquisition wascharacterized as an area without an accumulated body of findings from stud-ies “concerned with establishment of a theory of the lexicon” (Gass 1988:92).Aspects of the lexicon had been studied in detail by psycholinguists (e.g.,Aitchison 1987; Sternberger & MacWhinney 1988) as well as L2 researchersconcerned with teaching (Nation 1990), assessment (Perkins & Linnville 1987),and acquisition (Ard & Gass 1987). However, because of the sparseness of theresearch and the vastly different concerns of the researchers, the work at thistime did not add up to coherent perspectives that one might draw on for prac-tice. More recently, Singleton (1999) noted that “. . .enough ‘good’ research hasbeen published on the L2 mental lexicon in recent times to warrant a substan-tial review of the studies. . .” (p. 5). Another recent publication offers a healthynumber of practice-oriented studies with implications for L2 vocabulary teach-ing (Coady & Huckin 1997). As lexicogrammatical patterns have recently takena central position in much of the study of second language acquisition, it istimely to reconsider to what extent findings might ultimately add up to thetype of useful theory that might guide practice.

If every study contributes individual pieces to the store of professionalknowledge about vocabulary acquisition, the larger quantity of recent studieswould not necessarily imply progress toward developing a theory of L2 vocab-ulary acquisition. Contributing to an empirically supported theory of the L2lexicon would require researchers to interpret research results in view of theinferences underlying observed performance on research tasks as well as its de-pendability and generalizability – in other words, the assessment basis for theresearch. I will begin by explaining why I believe that vocabulary acquisitiontheory-for-practice rests on principles of assessment, and then, from this per-spective, I explain the connection between assessment and the central conceptsof this volume – inference, dependability, and generalizability. I will argue thatin theory-related vocabulary assessment, inference, dependability and general-izability need to be viewed in relation to the construct definition that underliesthe study of vocabulary acquisition, and the construct framework from whichconstruct definitions are derived. Based on these concepts, I identify three di-rections in which progress seems evident in developing theory-for-practice inL2 vocabulary acquisition.



. The assessment basis for vocabulary theory

Carter and McCarthy (1988) opened their book with the types of questionsabout vocabulary acquisition for which students and teachers would like an-swers. For example, “How many words provide a working vocabulary in aforeign language?” and “What are the best ways of retaining new words?” (p. 1–2). Defensible answers to these questions, like the problems I mentioned aboveabout scoring constructed vocabulary responses and evaluating tasks for in-cidental vocabulary acquisition, would require a means for the researcher todetermine whether or not a learner knows a given set of words.

The first question would require a researcher to identify learners who wereable to perform in a criterion context, such as studying at an English-mediumuniversity, serving in a restaurant where English is used as the medium ofcommunication, or writing reports summarizing marketing information aboutcorn in English. Having identified the appropriate language users, the re-searcher would have to assess the vocabulary size by constructing and ad-ministering an assessment that could be demonstrated to be appropriate forestimating vocabulary size (e.g., Nation 1993). The second question would re-quire an investigation of learners who used different strategies for learning newwords. Two groups of learners might be taught different learning strategies orgiven different learning materials. Alternatively, a within group design wouldallow the researcher to gather observations of strategies that learners used forindividual words and show the connection between particular strategies andword knowledge (Plass, Chun, Mayer, & Leutner 1998; Nassaji 2003). Regard-less of how the learning processes for retaining new words were determined,however, the ultimate results of the research would depend on assessment ofwhich words had been learned at a given point in time.

Robust answers to neither of the questions could be based on a singlestudy, but rather evidence would be gradually supported over the course oftime through multiple studies targeting the same issues and building uponone another. If one study assesses vocabulary through a multiple-choice, four-alternative test of semantic recognition knowledge, and the next study assessesvocabulary knowledge through the use of a C-test, for example, how should theresults of these two studies be interpreted to address the question about suc-cessful strategies? If individual research studies are to contribute to the type oftheory-for-practice that Carter and McCarthy were requesting, each needs tobe interpreted in a way that appropriately informs an existing knowledge baseabout L2 vocabulary acquisition.


Carol A. Chapelle

Assessment in any individual study as well as interpretation of resultsacross studies relies on the researcher’s understanding of inferential processesinherent in assessment. In the first example, appropriate interpretation of anassessment of vocabulary size requires the researcher to recognize that the re-sults of the learners’ scores on the vocabulary size test are not themselves thelearners’ vocabulary size. Rather vocabulary size is inferred from a summary ofobserved performance on the test. Similarly, in the second example, the learn-ers’ results on the vocabulary posttest are not equivalent to the learner’s knowl-edge, but rather the learners’ knowledge is inferred from results. In the thirdexample, the results of the two tests do not address exactly the same aspectsof vocabulary knowledge. Even though each test could be argued to indicatesomething about vocabulary knowledge, the two measure different aspects ofvocabulary knowledge. In each of these cases, an interpretation of results in away that potentially contributes to vocabulary theory requires appropriate in-terpretations of what the scores on the vocabulary assessments mean. In noneof the examples could the vocabulary scores be interpreted to mean the level ofknowledge on all of aspects of vocabulary that applied linguists would considerto be part of vocabulary knowledge. Instead, the scores obtained in researchrequire the researcher to make inferences about vocabulary knowledge.

. Inference in assessment

Inference in assessment refers to the logical connection that the researcherdraws between observed performance and what the performance means. Theprocess of inference has been an important part of the conceptual underpin-nings of educational and psychological measurement throughout the modernhistory of the field (e.g., Cronbach & Meehl 1955; Messick 1989). Agreementamong different schools of thought has not always existed on the kind of infer-ences made on the basis of scores, but researchers do agree that the score (orother summary of performance on a single occasion) is seldom in itself whatthe researcher is interested in. Instead, a gap exists between the score and theinterpretation of interest. Recent work in assessment has helped to clarify themultifaceted inferences that are associated with the interpretation of test per-formance (Kane 1992; Kane, Crooks, & Cohen 1999; Kane 2001), so whereasin the past the types of inferences associated with assessment were tied to theschool of thought of the researcher, today it is recognized that multiple typesof inferences are associated with scores, and therefore it is up to the researcherto specify the types of inferences that are central to the interpretations that



the researcher wishes to make. In vocabulary acquisition research, at least twotypes – a construct inference and a theory inference – are critical for making senseof research results in a way that holds potential for informing a vocabularyacquisition theory-for-practice, and additional inferences are also relevant.

Construct inference

Virtually any researcher investigating vocabulary acquisition theory draws aninference between the observed performance summary and the construct ofvocabulary that the performance is intended to signify. Figure 1 illustrates theinferential relationship between the observed performance on the vocabularymeasure and the construct definition assumed as the basis for the test. For ex-ample, the pioneering studies of L2 lexicon conducted by Paul Meara in the1970’s and 1980’s began with the aim of developing a more general understand-ing of the L2 lexicon than what could be obtained by examining error in a posthoc fashion or by probing the depth of knowledge of the idiomatic possibilitiesof a single lexical item. He attempted what he called a “more fundamentalistaccount of what interlanguage is” suggesting that researchers consider “a verycrude model of what a lexicon might look like, and use this to ask some simplequestions about learners’ lexicons” (1984:231). He was attempting to get awayfrom reliance on the observed data to make explicit the object of real interest –the more general theory of the L2 lexicon.

Meara (1984) defined the construct underlying performance on a wordassociation test as the organization of the L2 lexicon, suggesting the use of ex-perimental data such as those obtained from word association tests to supporthypotheses about lexicon organization. In a discussion of that paper pub-

Inference

ConstructDefinition

ObservedPerformance

Figure 1. The inferential link between construct definition and observed performance


Carol A. Chapelle

lished in the same volume, Michael Sharwood Smith argued that “a move tomake research more fundamentalist by adopting the methods and techniquesof experimental psycholinguistics will only be worthwhile if accompanied bya serious consideration of the theoretical underpinnings” and he pointed outthat “experimental data cannot of themselves inform us about the nature of thelearner’s current mental lexicon” (1984:239). In other words, Sharwood Smithwas asking Meara to justify the inference that he was assuming between theconstruct of vocabulary organization and the performance on the word asso-ciation tests. Methods for justification of this type of inference are the centralresearch objective in language assessment.

The idea of the inferential link between the observed performance on a vo-cabulary measure and the construct of vocabulary knowledge would be readilyaccepted by most positivist, theory-seeking vocabulary acquisition researcherstoday. In other words, the fundamental idea that Meara suggested about inves-tigating the mental lexicon is the central agenda of L2 vocabulary researchers.However, the purest positivist would insist that the construct definition wouldhave to precede and strongly influence the selection of the test, and would ques-tion the extent to which this prescribed order was followed in studies using theword association tests for the study of lexicon organization. On this issue, Iwould adopt a more neo-positivist position, which would see construct defini-tion and test development to be an iterative process, which are constrained byprofessional knowledge and operational realities.

Theory inference

A less acknowledged inference is the one linking the theory underlying the vo-cabulary construct of the test (e.g., vocabulary organization) with a broadertheoretical construct framework of vocabulary knowledge. As illustrated inFigure 2, the construct that any given test is intended to measure, shouldtypically be considered to be linked through inference to some aspects of amore complete theoretical construct framework. In research and practice, thetheoretical construct framework for vocabulary knowledge is recognized toencompass multiple aspects of knowledge. For example, Nation’s (2001:27)practice-oriented perspective includes in the framework of word knowledgethe receptive and productive knowledge of (1) structural word knowledge (in-cluding spoken and written forms, and word parts), (2) knowledge of wordmeaning (including meaning-form relations, concepts and referents, and as-sociations), and (3) word use (including grammatical functions, collocations,and constraints on use). From an assessment perspective, a theoretical con-



ConstructInference

ConstructDefinition

ObservedPerformance

TheoryInference

Theoretical ConstructFramework

Figure 2. The inferential links between construct definition and observed perfor-mance, and between construct definition and the theoretical construct framework

struct framework would also include aspects such as vocabulary size, orga-nization, and depth, i.e., other aspects that have been assessed as a means ofdetecting vocabulary acquisition (Read & Chapelle 2001).

It should be evident from these examples of dimensions of vocabularyknowledge that might comprise the construct framework that a single assess-ment or even multiple assessments would not succeed in assessing all dimen-sions of the framework within a single study or set of studies. No research hasever attempted to do so. Rather, the construct definition, which can encompassone or more aspects from the framework, offers a more realistic target for oneor more assessments in vocabulary research.

The idea of a vocabulary construct framework is related to, but not thesame as the framework for communicative language ability (Bachman 1990;Bachman & Palmer 1996) that has been productive in conceptualizing andinterpreting language tests. The framework for communicative language abil-ity contains “vocabulary” as a component within “grammatical knowledge,”which in turn is one aspect of language knowledge that comes into play alongwith other factors in language use. The theoretical framework for vocabularytakes vocabulary as the starting point, detailing parameters that pertain to de-ployment of vocabulary knowledge, such as size, knowledge of morphologicalcharacteristics, and vocabulary strategies.


Carol A. Chapelle

In positivist, measurement terms, constructs defined within the theoreti-cal vocabulary framework would be within the nomothetic span (Cronbach &Meehl 1955; Embretson 1983) of constructs derived from the communicativelanguage ability framework, but the specific relationships and their strengthswould depend on the specific constructs of interest. For example, a constructsuch as writing ability for formal business-like letters might be defined throughthe relevant language knowledge (i.e., the relevant organizational and prag-matic competence) and strategic competence (e.g., assessment of audience)from a communicative language ability construct framework. A vocabularyconstruct such as knowledge of morphology might be hypothesized to bestrongly related to the writing construct because of the need for this knowledgein deployment of precise and correct word forms in such writing. A vocabularyconstruct such as size might be hypothesized to be less related to the writingconstruct because during writing the vocabulary that the examinee needs tohandle is chosen by the examinee.

The distinction between the construct definition and the theoretical frame-work for vocabulary has important implications in SLA research. First, if theconstruct definition is defined narrowly as something that can actually be as-sessed by one or multiple assessments, the researchers’ responsibility for defin-ing the construct of interest and justifying the inference can be taken as some-thing that can be accomplished, or at least addressed. The idea of justifyinginferences from one or multiple measures to the entire theoretical frameworkis overwhelming, and the assessment will fail miserably. Second, since scoreson an assessment are not isomorphic with vocabulary knowledge, the theo-retical relevance of any test results needs to be interpreted in view of the twoinferential steps that link observed performance to a carefully delineated con-struct and ultimately to a theoretical framework. Third, since the theoreticalframework is larger and more complex than any single construct investigatedby one or multiple measures in a study, it follows that the study of all facetsof the framework requires the use of a variety of types of assessments. Thisimplication is contrary to proposals that all vocabulary researchers adopt thesame methods of measurement in vocabulary acquisition research to allow forcomparable results. What is needed is not a single method of measurement butdefensible inferences to appropriate constructs and a single framework.

Inference, dependability, and generalizability

Recent perspectives from assessment (e.g., Kane 2001) present the ideas of in-ference, dependability and generalizability as closely related. According to rele-



ConstructInference

ConstructDefinition

ObservedPerformance

GeneralizationInference

ExtrapolationInference

Performance onSimilar Task




Performancebeyond theAssessmentTasks

��

�Figure 3. Construct, generalization, and extrapolation inferences embedded in theinterpretation of performance on assessments

vant perspectives in assessment, the construct and theory inferences describedabove would be only two types of potentially many inferences associated withobserved performance; an additional type of inference is the generalization thatthe researcher typically draws from performance on the test or research taskto performance on other similar tasks. The idea is that when performance isobserved on one task, it is assumed that this performance is a good represen-tative of what would be displayed, on average, across a large number of similartasks, as illustrated in Figure 3. In other words, the researcher is inferringsimilar performance across tasks. When put this way, the generalization infer-ence encompasses the same idea as dependability. In other words, the inferencethat assumes that performance on the test task generalizes to performance onother similar tasks also assumes that performance is dependable across tasks.Generalization in applied linguistics is also used to refer to yet another infer-ence, which more recently has been called “extrapolation.” This refers to theinference linking performance on the task (or collection of similar tasks) toperformance that would be displayed beyond the task(s).

Both the generalization inference (dependability) and the extrapolation in-ference are important for L2 acquisition theory for a number of reasons, in theinterest of conciseness, however, I will focus on these inferences for the mostpart in terms of how they are related to the construct that the vocabulary as-sessment is intended to measure. In my view, the construct definition that iscentral to the various assessment-based inferences associated with vocabularyassessment is L2 vocabulary research.


Carol A. Chapelle

Construct definition as central

Construct definition is central to the inferential process of assessment becauseof its relationship with each of the inferences required for interpretation ofperformance, as summarized in Table 1. Construct definition is central to theconstruct inference because it provides a description of the intended con-struct against which the test design and test performance can be evaluated.The process of justifying the construct inference is construct validation, whichencompasses a range of quantitative and qualitative methods for investigat-ing the extent to which the construct inferences are justified. Interpretationof results relative to a broader theoretical framework, i.e., the theory inference,requires identification of the specific components of the overall theoretical con-struct framework that the test is intended to measure. The construct definition,therefore serves as intermediary between observed performance and the theo-retical construct framework, about which evidence in needed for L2 vocabularyacquisition theory. The link between the construct and the theoretical frame-work can be justified only so far as both are well-defined. Because of the limitedrecognition of this inference in L2 vocabulary research, methods for justifyingits appropriateness need to be explored.

The generalization inference is connected less directly to the construct def-inition, but in L2 vocabulary acquisition research an important connectionexists because the construct definition includes the range of knowledge thatis important for specifying the range of tasks that form the relevant domainof tasks for generalization. The more narrowly and precisely the construct isdefined the more likely it will be that the researcher can define and samplefrom the relevant universe of tasks. Dependability is another way of stating thesame issue. Evidence that performance can be generalized to the appropriate

Table 1. Connections of four types of inferences to construct definition

Inference Relationship to the construct definition

Construct Provides a description of the intended construct against which the testdesign and test performance can be evaluated

Theory Identifies the specific components of the overall theoretical constructframework that the test is intended to measure relative to the whole.

Generalization(Dependability)

Includes the range of knowledge that is important for specifying therange of tasks that form the relevant domain of tasks for generalization.

Extrapolation Includes the type of knowledge that should clarify the domain to whichthe results should extrapolate.



domain suggests that data are sufficiently stable and free of error to trust themas evidence about a theoretical construct. In other words, dependability refersto having enough data and good enough data. Whether or not enough datahave been yielded from performance depends on the scope of the constructdefinition. A construct narrowly defined as “vocabulary for reading Americanmenus” might require fewer observations to obtain a stable sample than a morebroadly defined construct such as “vocabulary ability for academic study inEnglish-speaking North America”. Similarly, whether the data are good enoughdepends in large part on what the data are supposed to reflect. Bachman (1990)put this concept very clearly: “The concerns of reliability and validity can . . .beseen as leading to two complementary objectives. . .: (1) to minimize the effectsof measurement error, and (2) to maximize the effects of the language abili-ties that we want to measure” (1990:161). In later work Bachman and Palmer(1996) refer to “measurement error” as “unmotivated variation.” This definesthe issue even more clearly: in order to distinguish error or unmotivated re-sults from construct relevant or motivated ones, a clear construct definition isessential.

The extrapolation inference is related to construct definition because theway the construct is defined should help to clarify the domain to which theresults should extrapolate. Again, the issue is the scope of the construct defi-nition. For example, a construct stated as “vocabulary organization” does notimply any particular domain of extrapolation. Instead, if the construct defi-nition refers to vocabulary organization as it is structured during particularactivities (Votaw 1992) the general specification of the activities offers guidanceto specify the appropriate domain of extrapolation. Another example would bethe contrast between the construct of vocabulary size in general as I have re-ferred to it above and the more specific receptive vocabulary size for universitystudies in Dutch universities (Hazenberg & Hulstijn 1996).

. Progress for vocabulary acquisition research

Having described the problem of developing L2 vocabulary acquisition the-ory in terms of the inferences that researchers need to be able to make fromobserved performance data, I can now assess the extent to which the more plen-tiful recent work in this area can be seen as representing progress. Three areasof progress are apparent.


Carol A. Chapelle

Probing construct definition

In view of the central role of construct definition for all inferences, the explicitdiscussion of vocabulary knowledge as a construct seems to mark progress.Meara’s suggestion to look at a “crude model of what a lexicon might looklike” called for considering vocabulary as a construct. Telchrow (1982) dis-tinguished two constructs – receptive and productive vocabulary ability – adistinction that has continued to be useful for in the work of Laufer (1998),which looks at the development of these different but related constructs. Otherresearch attempted to expand on the construct that had been defined primar-ily through the dimensions of size and mental organization, and with lists ofcharacterizing word knowledge. Ard and Gass (1987), for example, exploredthe syntactic dimensions of the construct, and Read (1993, 1998) attempted todevelop a measurable construct of vocabulary depth. Chapelle (1998) made ex-plicit three perspectives from which a construct definition for vocabulary canbe developed: A trait definition is specified as general knowledge intended tobe relevant across many domains, a behaviorist definition is stated in terms ofthe contexts in which performance is expected to be evident, and an interac-tionalist definition includes the relevant knowledge delimited by the contextsof interest. Rather than expanding on or detailing a composite vocabulary con-struct, the three perspectives articulate different perspectives that can be takento in defining vocabulary as a construct.

In today’s work, researchers have not gone so far as to describe the con-struct they are investigating in terms of the three approaches, but nevertheless,it seems that we have passed the time when vocabulary ability was a fuzzy con-cept for which some evidence could be gained from any learner performancethat happened to be on hand. Instead, researchers are investigating specifically-defined aspects of the construct such as productive knowledge of derivativemorphology (Schmitt & Zimmerman 2002) with tests that are argued to sup-port valid construct inferences. In like fashion some of the vocabulary tests thathave been developed target a specific construct such as depth of vocabularyknowledge (Wesche & Paribakht 1996). These steps forward might eventu-ally benefit L2 vocabulary theory through increased clarity in specifying theconstruct definition underlying a specific test as distinct from the theoreticalconstruct framework.



Validation studies

Perhaps the most obvious sign of progress in L2 vocabulary research is theincrease in validation studies of measures that are used in both L2 teachingand research. A validation study focuses specifically on producing evidencefor the appropriateness of the inferences made from assessment scores. Twosuch validation studies were reported recently in Language Testing concerningvocabulary measures that have been in use for many years in SLA research.One study found reason to question the inferences associated with the Yes/Novocabulary test (Beeckmans, Eychmans, Janssens, Dufranne, & Van de Velde2001), and another presents a range of validity evidence concerning the Vo-cabulary Levels Test (Schmitt, Schmitt, & Clapham 2001). These two examplesmay mark the beginning of a positive trend toward empirical validation of thevocabulary tests used in L2 studies – a process which is not altogether new (e.g.,Cronbach 1942, 1943; Feifel & Lorge 1950), but which has not been undertakenfor measures used in L2 vocabulary studies, which have tended to rely on re-searchers’ judgments of what tests might measure (e.g., Kruse, Pankhurst, &Sharwoor Smith 1987; Laufer 1990; Palmberg 1987; Singleton & Little 1991).Although researchers’ judgments are not irrelevant, they should not constitutethe entire validity argument.

Validation studies hopefully mark a beginning for the accumulation of va-lidity evidence that can be developed into more thorough validity argumentsalong the lines suggested by researchers in educational measurement (Messick1989; Kane 2001). Such an argument needs to accumulate evidence about atest or test method from multiple sources to provide evidence about the con-struct underlying test performance, as illustrated by Chapelle (1994). Such anargument allows the researcher to draw on the relevant theoretical and empir-ical rationales from linguistics (e.g., Anshen & Aronoff 1988; Wray 2002) andcognitive psychology (e.g., Schwanenflugel & Shoben 1985; Tyler & Wessels1983) to develop a construct definition with appropriate depth and to weighthe justification of links between the construct definition and empirical re-sults. Developing a validity argument for L2 vocabulary in this way forces theresearcher to state an explicit L2 vocabulary construct definition.

Salience of L2 vocabulary assessment

One of the most encouraging signs of progress in the work on vocabulary ac-quisition is the link being made among those working in teaching, researchand assessment, as demonstrated in two recent books. In Paul Nation’s (2001)


Carol A. Chapelle

Learning Vocabulary in Another Language, L2 vocabulary acquisition is sur-veyed, particularly as it pertains to language teaching. The book contains teststhat have been developed and used in the classroom and for research and itmakes reference to the validation research on the tests (e.g., Schmitt, Schmitt,& Clapham 2001). The book reviews some issues in vocabulary assessment andexplicitly discusses limitations of vocabulary tests with statements such as thefollowing: “Direct tests of vocabulary size . . . do not show whether learners areable to make use of the vocabulary they know; and they do not measure learn-ers’ control of essential language learning strategies like guessing from context,dictionary use and direct vocabulary learning” (Nation 2001:382).

Read’s (2000) book, Assessing Vocabulary, summarizes the state of the fieldwith respect to construct definition while showing the need for a clear con-struct definition to guide development and evaluation of vocabulary tests.Read’s approach to vocabulary assessment takes a step toward helping vo-cabulary researchers understand the links between test design and constructdefinition. He identifies characteristics of vocabulary measures that help todefine any given measure in terms that can be used for comparison. Heuses a framework including three dimensions (i.e., discrete/embedded, selec-tive/comprehensive, and context-independent/context-dependent), and thenuses these dimensions to describe a variety of vocabulary measures. Vocabularyassessments can therefore be discussed within a common framework. Having ameans of distinguishing among different types of measures is an essential firststep toward avoiding the confusion ensuing from attempts to compare resultsfrom two different L2 vocabulary tests measuring different constructs.

The next step in the process of understanding inferences, generalizeabilityand dependability is to relate Read’s assessment framework to the perspec-tives of construct definition. For example, the discrete/selective/context inde-pendent vocabulary tests that have been used for so many years tend to beappropriate for making inferences about a trait-type construct definition. Asvocabulary assessment moves closer to the ideals of other areas of language as-sessment with embedded, comprehensive, and/or context dependent measures,the construct about which inferences are made moves away from the trait-typeto a more interactionalist-type. Read and Chapelle (2001) attempt a prelimi-nary analysis of some vocabulary assessments along these lines. It remains tobe seen if this approach is useful for vocabulary researchers attempting to ad-dress construct and theory inferences in L2 vocabulary research. But in themeantime, combining perspectives from acquisition, teaching and assessment,as they pertain to the issues of construct definition and validation, Read is ableto demonstrate, as Schoonen puts it in his review of Read’s book in Language



Testing, “that there is much more to vocabulary assessment than the traditionalmultiple choice test” (Schoonen 2001:118).

. Conclusion

The landscape in L2 vocabulary research is radically different today than it was15 years ago. The three areas of progress I mentioned center on inferences thatdepend on the construct definition underlying a test, and in particular theconstruct inference has been the focus of much attention. In contrast, muchless progress is apparent in studying the theory inference. Huckin and Haynes(1993), for example, suggest that “researchers would do well. . . to try to assem-ble converging evidence from multiple methods. Single methods tend to yieldone-dimensional perspectives, which are simply inadequate for a subject ascomplex as second language reading and vocabulary learning” (p. 297). In ad-dition to this good measurement advice, advice in needed about how data frommultiple methods should be interpreted. Do multiple methods of measure-ment speak to the construct inference, to the theory inference, or to both? Theidea that (at least) two inferences are needed to link observed performance datato a theoretical construct framework has not been recognized explicitly in theL2 vocabulary research literature. Such recognition rests on a clear specificationof the inferences that the researcher makes to explain observed performance.

Notes

. Assessment does not have to be conducted from a logical positivist’s perspective; however,most readers would associate assessment with logical positivism. Moreover, the conceptualapparatus of inference, dependability and generalizability was developed within this per-spective. See Lynch (2003) for discussion of logical positivist vs. interpretivist perspectiveson assessment.

. I am using the term “knowledge” throughout because this is the term that is typicallyused in L2 vocabulary research. In fact, in most applied settings, what would be of interestmight more arguably be called “ability” which includes both knowledge and the processesfor putting the knowledge to use, by analogy with “communicative language ability” asdefined by Bachman (1990).


Carol A. Chapelle

References

Aitchison, J. (1987). Words in the mind: An introduction to the mental lexicon. New York:Basil Blackwell.

Anshen, F. & Aronoff, M. (1988). Producing morphologically complex words. Linguistics,26, 641–655.

Ard, J. & Gass, S. (1987). Lexical constraints on syntactic acquisition. Studies in SecondLanguage Acquisition, 9, 233–351.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OUP.Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: OUP.Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H. (2001).

Examining the Yes/No vocabulary test: Some methodological issues in theory andpractice. Language Testing, 18(3), 235–274.

Carter, R. & McCarthy, M. (1988). Vocabulary and language teaching. London: Longman.Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.

Bachman & A. D. Cohen (Eds.), Second language acquisition and language testinginterfaces (pp. 32–70). Cambridge: CUP.

Chapelle, C. (1994). Are C-tests valid measures for L2 vocabulary research? Second LanguageResearch, 10(2), 157–187.

Coady, J. & Huckin, T. (Eds.). (1997). Second language vocabulary acquisition. Cambridge:CUP.

Cronbach, L. J. (1942). Analysis of techniques for diagnostic vocabulary testing. Journal ofEducational Research, 36, 206–217.

Cronbach, L. J. (1943). Measuring knowledge of precise word meaning. Journal ofEducational Research, 36, 528–534.

Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. PsychologicalBulletin, 52, 281–302.

Embretson, S. (1983). Construct validity: construct representation versus nomothetic span.Psychological Bulletin, 93(1), 179–197.

Feifel, H. & Lorge, I. (1950). Qualitative differences in the vocabulary responses of children.Journal of Educational Psychology, 41, 1–18.

Gass, S. M. (1988). L2 Vocabulary Acquisition. Annual Review of Applied Linguistics, 9, 92–106.

Hazenberg, S. & Hulstijn, J. H. (1996). Defining a minimal receptive second languagevocabulary for non-native university students: An empirical investigation. AppliedLinguistics 17(2), 145–163.

Huckin, T. & Haynes, M. (1993). Summary and future directions. In T. Huckin, M. Haynes,& J. Coady (Eds.), Second language reading and vocabulary learning (pp. 289–298).Norwood, NJ: Ablex.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,527–535.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement,38, 319–342.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. EducationalMeasurement: Issues and Practice, 18(2), 5–17.



Kruse, H., Pankhurst, J., & Sharwood Smith, M. (1987). A multiple word association probein second language acquisition research. Studies in Second Language Acquisition, 9, 141–154.

Laufer, B. (1998). The development of passive and active vocabulary in a second language.Applied Linguistics, 19(2), 255–271.

Laufer, B. (1990). “Sequence” and “Order” in the development of L2 lexis: Some evidencefrom lexical confusions. Applied Linguistics, 11(3), 281–296.

Lynch, B. (2003). Language assessment and program evaluation. New York: ColumbiaUniversity Press.

Meara, P. (1984). The study of lexis in interlanguage. In A. Davies, C. Criper, & A. P. R.Howatt (Eds.), Interlanguage (pp. 225–240). Edinburgh: Edinburgh University Press.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). NY: Macmillan.

Nassaji, H. (2003). L2 vocabulary learning from context: strategies, knowledge sources, andtheir relationship with success in L2 lexical inferencing. TESOL Quarterly, 37(4), 645–670.

Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: CUP.Nation, I. S. P. (1993). Using dictionaries to estimate vocabulary size: Essential but rarely

followed procedures. Language Testing, 10, 27–40.Nation, I. S. P. (1990). Teaching and learning vocabulary in another language. New York, NY:

Heinle and Heinle.Palmberg, R. (1987). Patterns of vocabulary development in foreign language learners.

Studies in Second Language Acquisition, 9, 201–220.Perkins, K. & Linnville, S. (1987). A construct definition study of a standardized ESL

vocabulary test. Language Testing, 4(2), 126–141.Plass, J. L., Chun, D. M., Mayer, R. E., & Leutner, D. (1998). Supporting visual and verbal

learning preferences in a second-language multimedia learning environment. Journalof Educational Psychology, 90(1), 25–36.

Read, J. (2000). Assessing vocabulary. Cambridge: CUP.Read, J. (1998). Validating a test to measure depth of vocabulary knowledge. In A. Kunnan

(Ed.), Validation in language assessment (pp. 41–60). Mahwah, NJ: Lawrence Erlbaum.Read, J. (1993). The development of a new measure of L2 vocabulary knowledge. Language

Testing, 10, 355–371.Read, J. & Chapelle, C. A. (2001). A framework for second language vocabulary assessment.

Language Testing, 18, 1–32.Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behavior of

two new versions of the Vocabulary Levels Test. Language Testing, 18(1), 55–88.Schmitt, N. & Zimmerman, C. (2002). Derivative word forms: What do learners know?

TESOL Quarterly, 36(2), 145–171.Schoonen, R. (2001). [Book Review.] Language Testing, 18(1), 118–125.Schwanenflugel, P. L. & Shoben, E. J. (1985). The influence of sentence constraint on the

scope of facilitation for upcoming words. Journal of Memory and Language, 24, 232–252.

Sharwood Smith, M. (1984). Discussion. In A. Davies, C. Criper, & A. P. R. Howatt (Eds.),Interlanguage (pp. 236–239). Edinburgh: Edinburgh University Press.


Carol A. Chapelle

Singleton, D. (1999). Exploring the mental lexicon. Cambridge: CUP.Singleton, D. & Little, D. (1991). The second language lexicon: Some evidence from

university-level learners of French and German. Second Language Research, 7, 62–81.Sternberger, J. P. & MacWhinney, B. (1988). Are inflected forms stored in the lexicon? In

M. Hammond & M. Noonan (Eds.), Theoretical morphology: Approaches in modernlinguistics (pp. 101–116). San Diego, CA: Academic Press.

Telchrow, J. M. (1982). A survey of receptive versus productive vocabulary. InterlanguageStudies Bulletin, 6, 3–33.

Tyler, L. K. & Wessels, J. (1983). Quantifying contextual contributions to word recognitionprocesses. Perception and Psychophysics, 34, 409–420.

Votaw, M. C. (1992). A functional view of bilingual lexicosemantic organization. In R. J.Harris (Ed.), Cognitive processing in bilinguals, (pp. 299–321). New York, NY: ElsevierScience Publishers.

Wesche, M. & Parabut, T. S. (1996). Assessing second language vocabulary knowledge:Depth vs. breadth. Canadian Modern Language Review, 53, 13–39.

Wray, A. (2002). Formulaic language and the lexicon. Cambridge: CUP.


Beyond generalizability

Contextualization, complexity, and credibilityin applied linguistics research

Patricia A. DuffUniversity of British Columbia

This chapter discusses the issues of inference and generalizability inqualitative applied linguistic research primarily. The following themes arealso explored in relation to generalizability (or transferability) in case studies,ethnographic classroom research, and other forms of qualitative research: (1)seeking contextualization; (2) understanding the complexity of behaviors,systems, and beliefs through triangulation and thick description; and (3)establishing the credibility of interpretations drawn from research and theirimportance for applied linguistic theorizing more generally. I conclude thatboth quantitative and qualitative applied linguistic research should seek tomaximize the awareness and importance of contextualization, complexity,and credibility in studies as well as analytic or naturalistic generalizability, aswe engage in rigorous, systematic, and meaningful forms of inquiry andinterpretation.

Introduction

Research in education and the social sciences has long been concerned with thebasis for inferences and conclusions drawn from empirical studies, and appliedlinguistics is no exception. With the emergence of qualitative, mixed-method,and innovative new approaches to research in recent decades, issues connectedwith validity in research and testing have received renewed attention by appliedlinguists, not only about how to design and interpret one’s own studies in or-der to make legitimate claims, but also how to interpret others’ research or thetools and products of research, such as tests, typologies, and so on. One partic-ular area of concern is the nature and scope of insights that can be generatedfrom qualitative research such as case study, ethnography, narrative inquiry,


Patricia A. Duff

conversation analysis, and so on, as well as within more familiar quantitativeresearch paradigms or inquiry traditions, especially when small sample sizesare involved. Seeking and reaching consensus regarding criteria for conduct-ing and evaluating high quality research within each tradition has thereforebecome a priority of late (e.g., Chapelle & Duff 2003; Edge & Richards 1998;Lazaraton 2003).

This chapter discusses inference and generalizability as they apply to qual-itative applied linguistic research primarily, while conceding that the termsqualitative and quantitative are overstated binaries when describing contem-porary so-called “new-paradigm” research designs. First, I define inference andgeneralizability as they are often understood and used in quantitative researchand consider their relevance to qualitative research. I then explore such themesas contextualization, complexity, and credibility in relation to generalizabilityin most qualitative research as fundamental ways of broaching validity. Casestudies and ethnographies in second language acquisition (SLA) and secondlanguage education (SLE) are selected to illustrate these principles. I concludethat both quantitative and qualitative applied linguistic research should seekto maximize the awareness and importance of contextualization, complexity,and credibility as well as analytic or naturalistic generalizability (however theseconcepts are taken up) as we promote rigorous, systematic, and meaningfulforms of inquiry and as we consider the inferences and interpretations that canbe drawn from our research.

Inference

The aim of research is to generate new insights and knowledge – in otherwords, to make various kinds of inferences based on observations. Quanti-tative research often stresses the importance of inference. In that paradigm,inference may refer to the degree to which we infer from people’s observ-able behaviors aspects of their underlying competence or knowledge systems,or it may refer to the nature of claims that are inferred from various kindsof evidence (e.g., about the relationships among variables). Inference, unlikegeneralizability, is a term that is seldom addressed explicitly or in a techni-cal sense in meta-methodological discussions of qualitative research, as a pe-rusal of the subject indexes of many qualitative research methods textbooksreveals (e.g., Crabtree & Miller 1999; Creswell 1998; Denzin & Lincoln 2000;Eisner & Peshkin 1990; Holliday 2002; Merriam 1998; Miles & Huberman1994; Neuman 1994; Silverman 2000). Even in quantitative or mixed-methods



textbooks, the term is mainly used in conjunction with particular types ofstatistics.1 Inferential statistics, such as t-tests or analysis of variance, make itpossible to draw certain kinds of conclusions about data in relation to researchquestions (e.g., causality), and especially about the relationship between thesample data and the characteristics of the larger population (Brown & Rogers2002; Fraenkel & Wallen 1996; Gall, Borg, & Gall 1996; Palys 1997). Inference,therefore, is closely related to the notion of generalizability: namely, whetherresults are generalizable to a larger group or to theoretical principles, that is,whether we can infer generality. In some respects then, inference and general-ization can refer to the same process, since generalization is a kind of inference.However, inference refers to a broader cognitive process connected with logicalreasoning, generality being just one sort of reasoning. Thus, although inferenceis a concept not typically discussed at length – or even in passing – in mostqualitative research, it is a form of reasoning that is implicit in the presentationand interpretation of research within any paradigm. For example, qualitativeresearch in language education seeks to draw inferences about such topics asthe values, conditions, linguistic and sociocultural knowledge, and personalexperience that underpin observable behaviors in order to understand howand why people behave or interact in certain ways, how they interpret theirbehaviors and situations, how learning proceeds, what instructional processesare deemed effective, the characteristics of learning cultures, the attributes ofcertain kinds of learners and teachers, and so on. Instead of inference, mostqualitative methodology textbooks discuss processes of interpretation, a kindof inference, a search for patterns and understandings, which is central to themeaning-making in qualitative research.

Generalizability in quantitative vs. qualitative research: Problematicsand possibilities

Generalizability, a crucial concept in positivist (generally quantitative) exper-imental research, aims to establish the relevance, significance, and externalvalidity of findings for situations or people beyond the immediate researchproject. That is, it is part of the process of establishing the nature of infer-ences that can be made about the findings and their applicability to the largerpopulation and to different environmental conditions and to theory more uni-versally. Generalizability (or generality as some describe it, e.g., Krathwohl1993), while typically discussed in connection with inferences about popula-tions (whether the observations about sampled individuals can be generalized


Patricia A. Duff

to others in the same population), can also involve the ability to generalize(effects) to “treatments, measures, study designs, and procedures other thanthose used in a given study,” according to Krathwohl (p. 735). Sampling proce-dures are one of several elements (e.g., research design) that affect the kindsof inferences that can be drawn about generality. It is commonly acceptedthat quantitative research, with appropriate sampling (random selection, largenumbers, etc.), research design (e.g., counterbalancing of treatments, ideallywith a control group, pre-post measures, and careful testing and coding pro-cedures), and inferential statistics where appropriate, has the potential to yieldgeneralizable results.

However, carefully controlled research may nevertheless provide inade-quate contextualization of the study, the participants, the tasks/treatments, andso on, and therefore may be less easily generalized than might otherwise be thecase. As Gall et al. (1996) point out, “generalizing research findings from [an]experimentally accessible population to a target population is risky” (p. 474).The two populations must be compared for crucial similarities. In addition, “ifthe treatment effects can be obtained only under a limited set of conditions oronly by the original researcher, the experimental findings are said to have lowecological validity” (p. 45), and thus low external validity as well. Safeguardsmust be in place both in designing and carrying out the experiments and alsoin reporting the results. Gall et al. (1996) provide a list of factors associatedwith external validity in experiments that researchers need to take into account(see Table 1), listed under the headings of population validity and ecologicalvalidity (after Bracht & Glass 1968).

Key sociocultural variables such as institutional context, first language (L1)background or the relationship between first and second language/culture, andother characteristics of the sample/population, the tasks, and even the relation-ship between the researcher and those researched might be underspecified oromitted, because of space limitations or because the variables are not consid-ered important or central to the study. The generalizability of findings to widerpopulations and contexts can inadvertently be reduced as a result – regardlessof the claims made by the researchers about generality. In a field as interna-tional, interlingual, and intercultural as SLA or SLE, the sociocultural, educa-tional, and linguistic contexts of research are of great importance (both macro-level social contexts and micro-level discursive/task contexts, Duff 1995). Todate, most of the published research conducted in TESOL, for example, hastaken place in American university or college programs with students within aparticular range of proficiency and educational preparedness that may or maynot be easily generalized to other types of programs (e.g., with children) or



Tabl

e1.

Gen

eral

izab

ility

and

quan

tita

tive

vs.q

ual

itat

ive

rese

arch

Qu

anti

tati

veQ

ual

itat

ive

Res

earc

hde

sign

s/m

etho

ds

–qu

asi-

/exp

erim

ents

,co

rrel

atio

ns,

surv

eys,

regr

es-

sion

s,fa

ctor

anal

yses

,etc

.–

case

stu

dy,

eth

nog

raph

y,co

nver

sati

onan

alys

is(C

A),

oth

erm

icro

-dis

cou

rse

anal

yses

,in

terv

iew

rese

arch

,doc

um

enta

nal

-ys

is,o

rso

me

com

bin

atio

nof

thes

e

Vie

wof

gen

eral

izab

ilit

y(e

xter

nal

vali

dity

)

–em

phas

ison

exte

rnal

valid

ity

(e.g

.in

expe

rim

ents

;G

alle

tal

.199

6:46

6):

1.po

pula

tion

valid

ity

2.ec

olog

ical

valid

ity,

wit

hat

ten

tion

to:

–ex

plic

itde

scri

ptio

nof

expe

rim

enta

ltre

atm

ent

–“r

epre

sen

tati

vede

sign

”(r

eflec

tsre

al-l

ife

envi

ron

-m

ents

and

nat

ura

lch

arac

teri

stic

sof

lear

ner

s)–

mu

ltip

le-t

reat

men

tin

terf

eren

ce–

Haw

thor

ne

effe

ct–

nov

elty

and

disr

upt

ion

effe

cts

–ex

peri

men

ter

effe

ct–

pret

est

sen

siti

zati

on–

post

test

sen

siti

zati

on–

inte

ract

ion

ofh

isto

ryan

dtr

eatm

ent

effe

cts

–m

easu

rem

ent

ofde

pen

den

tva

riab

le–

inte

ract

ion

ofti

me

ofm

easu

rem

ent

and

trea

tmen

tef

fect

s

Kra

thw

ohl(

1993

):ex

plan

atio

nge

ner

alit

y,tr

ansl

atio

nge

ner

alit

y,de

mon

stra

ted

gen

eral

ity,

rest

rict

ive

expl

anat

ion

s(c

ondi

tion

s)el

imin

ated

,rep

licab

lere

sult

–lim

ited

rele

van

ce,i

ntr

adit

ion

alvi

ews,

toth

ego

als

ofqu

alit

a-ti

vere

sear

ch;v

alid

ity

isde

scri

bed

inst

ead

inte

rms

of“t

ran

s-fe

rabi

lity,”

cata

lyti

can

dec

olog

ical

valid

ity,

cred

ibili

ty,

de-

pen

dabi

lity,

con

firm

abili

ty,

etc.

;“v

alid

ity”

and

“rel

iabi

lity”

are

ofte

nn

otco

mpa

rtm

enta

lized

(Gal

let

al.1

996)

–so

me

(pos

itiv

ist)

qual

itat

ive

rese

arch

ers

belie

veth

atth

rou

ghth

eag

greg

atio

nof

qual

itat

ive

stu

dies

from

mu

ltip

lesi

tes

(e.g

.ca

sest

udi

esth

rou

gha

case

surv

eym

eth

od,

thro

ugh

“met

a-et

hn

ogra

phy”

orcr

oss-

case

“tra

nsl

atio

n,”

and

the

com

para

-ti

vem

eth

od)

gen

eral

izat

ion

sm

aybe

war

ran

ted;

oth

ers

see

gen

eral

izab

ility

inte

rms

of“fi

t”be

twee

non

est

udy

/sit

uat

ion

and

anot

her

,bas

edon

thic

kde

scri

ptio

n(S

chofi

eld

1990

)


Patricia A. DuffTa

ble

1.(c

onti

nued

)

Qu

anti

tati

veQ

ual

itat

ive

Type

sof

gen

eral

izat

ion

–to

popu

lati

ons,

sett

ings

,con

diti

ons,

trea

tmen

ts–

toth

eori

es/m

odel

s(h

ypot

hes

es)

–to

popu

lati

ons

(sim

ilar

case

s/co

nte

xts;

also

know

nas

case

-to-

case

tran

slat

ion

),L

inco

ln&

Gu

ba(1

985)

,bas

edon

sim

ilari

-ti

esac

ross

situ

atio

ns/

con

text

s,an

den

han

ced

byth

ere

pres

en-

tati

ven

ess

ofsi

tes/

case

s/si

tuat

ion

sst

udi

ed–

toth

eori

es/m

odel

s(t

his

isth

em

ore

com

mon

appl

icat

ion

;al

sore

ferr

edto

as“a

nal

ytic

gen

eral

izab

ility

,”Fi

rest

one

1993

)

Stre

ngt

hs–

mu

ltip

leoc

curr

ence

sof

phen

omen

on/e

ffec

tsin

rel-

ativ

ely

con

trol

led

sett

ing;

quan

tifi

cati

onh

elps

esta

b-lis

hcl

ear

tren

dsan

dre

lati

onsh

ips;

stat

isti

cal

infe

r-en

ces

orex

trap

olat

ion

ofre

sult

sto

defi

ned

wid

erpo

pula

tion

poss

ible

(esp

ecia

llyw

ith

ran

dom

orpr

ob-

abili

tysa

mpl

ing)

–n

orm

ally

base

don

data

from

larg

esa

mpl

epo

pula

-ti

ons

–in

tern

al,e

xter

nal

valid

ity

and

oth

erty

pes

ofva

lidit

yof

ten

dem

onst

rate

dco

nvin

cin

gly;

atte

nti

onpa

idto

relia

bilit

yof

codi

ng,

test

ing,

etc.

–gr

eat

pote

nti

alfo

rri

chco

nte

xtu

aliz

atio

n,

acco

un

tin

gfo

rco

mpl

exit

yof

soci

al/l

ingu

isti

cph

enom

ena

–po

ten

tial

tore

son

ate

wit

hre

ader

sbe

cau

seof

acce

ssib

ility

,de-

scri

ptio

n,n

arra

tive

gen

rean

dex

ampl

es–

incr

itic

al/f

emin

ist

rese

arch

,th

epo

ten

tial

tom

ove

read

ers

toac

tion

tocr

eate

chan

ge;i

.e.“

cata

lyti

cva

lidit

y”(L

ath

er19

91)

–m

ore

oppo

rtu

nit

ies

todo

cum

ent

ecol

ogic

alor

inte

rnal

valid

-it

yth

anin

mu

chqu

anti

tati

vere

sear

ch;

but

less

pote

nti

alor

desi

reto

mak

ecl

aim

sof

exte

rnal

valid

ity

–ba

sed

onth

eore

tica

lor

purp

osiv

esa

mpl

ing

–tr

ansf

erab

ility

offi

ndi

ngs

isde

term

ined

not

only

byre

-se

arch

erbu

tal

soby

read

eran

dby

the

typi

calit

y,re

pres

en-

tati

ven

ess

orde

pth

ofde

scri

ptio

nof

case

/sit

uat

ion

–fo

cus

onm

ult

iple

mea

nin

gsan

din

terp

reta

tion

s

Way

ofst

ren

gthe

nin

gfu

rthe

r

–u

sin

gec

olog

ical

lyva

lidta

sks,

sett

ings

,pr

oced

ure

s;ad

dres

sin

gu

nde

rlyi

ng

con

stra

ints

onge

ner

aliz

abili

ty–

docu

men

tin

gco

nte

xts,

sam

plin

g,pr

oced

ure

s,in

stru

-m

ents

,etc

.in

asm

uch

deta

ilas

poss

ible

–h

avin

gan

aggr

egat

ion

ofm

ult

iple

-cas

eor

mu

ltip

le-s

ite

stu

d-ie

s;tr

ian

gula

tion

–th

ick

desc

ript

ion

–in

clu

depa

rtic

ipan

ts’

own

judg

men

tof

gen

eral

izab

il-it

y/re

pres

enta

tive

nes

sor

typi

calit

y(H

amm

ersl

ey19

92);

corr

obor

atio

nfr

omot

her

stu

dies

(cf.

Max

wel

l19

96),

inad

diti

onto

obse

rvat

ion

s



Tabl

e1.

(con

tinu

ed)

Qu

anti

tati

veQ

ual

itat

ive

Pote

nti

alpr

oble

ms

–ov

er-e

mph

asis

onex

tern

alva

lidit

yin

som

eca

ses,

atth

eex

pen

seof

inte

rnal

valid

ity

(e.g

.,ta

sk-b

ased

re-

sear

ch,

wit

hst

ran

gers

,or

usi

ng

arti

fici

alla

ngu

ages

,co

ntr

ived

task

s):

How

far

can

fin

din

gsab

out

inte

r-ac

tion

patt

ern

sbe

gen

eral

ized

to“n

orm

al/n

atu

ral”

lear

nin

gco

ndi

tion

s?–

failu

reto

repl

icat

eor

follo

wu

pon

stu

dies

wit

hdi

ffer

-en

tpo

pula

tion

san

din

diff

eren

tco

nte

xts

may

lead

tode

fact

oge

ner

aliz

atio

n–

too

ofte

n,

the

rese

arch

focu

ses

onon

lyth

ean

alys

ts’

anal

ysis

/per

spec

tive

san

dn

otth

ose

ofpa

rtic

ipan

tsor

oth

erob

serv

ers

(alt

hou

ghco

din

gis

usu

ally

stre

ngt

h-

ened

byge

ttin

g2n

dra

ters

,usi

ng

oper

atio

nal

defi

ni-

tion

s/co

nst

ruct

s,et

c.)

–st

atis

tica

lin

fere

nce

sm

aybe

base

don

inap

prop

riat

est

atis

tica

lpr

oced

ure

sfo

rty

pes

ofda

tain

volv

ed(c

f.,

Hat

ch&

Laz

arat

on19

91)

–ov

er-e

mph

asis

onpo

ssib

lyat

ypic

al,

crit

ical

,ex

trem

e,id

eal,

uniq

ue,o

rpa

thol

ogic

alca

ses,

rath

erth

anty

pica

lor

repr

esen

-ta

tive

case

s(e

.g.,

Gen

ie,A

lber

to,W

es;i

.e.i

nte

rms

offo

ssili

za-

tion

,cri

tica

lper

iod

stu

dies

,exc

epti

onal

lygo

odor

inef

fect

ive

lan

guag

ele

arn

ers)

–“t

ellin

gca

ses”

may

not

alw

ays

beh

igh

lyre

pres

enta

tive

case

s;th

eym

aybe

very

hel

pfu

lin

prov

idin

gin

sigh

tsin

toSL

Abu

tco

ndi

tion

sor

insi

ghts

may

not

appl

ybr

oadl

yto

oth

-er

s;e.

g.,

auto

biog

raph

ical

case

sof

met

alin

guis

tica

llyso

phis

-ti

cate

dle

arn

ers

(Sch

mid

t&

Frot

a19

96)

wh

om

aybe

atyp

ical

–te

nde

ncy

toge

ner

aliz

ew

idel

yto

theo

ry(a

ccu

ltu

rati

onm

odel

,n

otic

e-th

e-ga

ppr

inci

ple)

,des

pite

disc

laim

ers,

wit

hfe

wsi

mi-

lar

case

stu

dies

con

duct

ed(i

.e.i

nlie

uof

repl

icat

ion

,mu

ltip

le-

case

-stu

dies

)–

obse

rvat

ion

sm

ayla

ckre

leva

nce

tofi

eld

mor

ew

idel

y,ou

tsid

eof

imm

edia

teco

nte

xt–

nee

dfo

ren

duri

ng

key

them

es,

con

stru

ctio

ns,

orre

lati

on-

ship

sam

ong

fact

ors

that

will

beh

elpf

ul

toot

her

rese

arch

ers

inot

her

con

text

s;go

ing

beyo

nd

part

icu

lari

stic

loca

lobs

erva

-ti

ons

tom

ore

gen

eral

tren

ds–

rigo

rou

sev

iden

ceor

auth

enti

cati

onm

ayse

emto

bela

ckin

g(E

dge

&R

ich

ards

1998

)bu

tgo

alsh

ould

beto

“in

terp

ret

and

...o

ffer

a[c

onte

xt-s

peci

fic]

un

ders

tan

din

g”(p

.350

)–

see

Laz

arat

on’s

(200

3)di

scu

ssio

nof

“cri

teri

olog

y”an

dqu

ali-

tati

veso

cial

rese

arch


Patricia A. Duff

in countries with very different educational systems, cultures, histories, andeconomies (e.g., in EFL regions).

The more controlled and laboratory-like the SLA studies (e.g., Hulstijn& DeKeyser 1997), often using very contrived tasks (even involving artifi-cial languages or invented images in some cases in order to control for priorknowledge), the less generalizable the findings are, in my view, either to largerpopulations from which samples are drawn or to broader understandings oflanguage teaching, learning, or use in classrooms or other naturally occurringsettings. That is, it may be difficult and unwise to generalize from behaviors ofunfamiliar pairs of interlocutors doing unclassroom-like research tasks underlaboratory-like conditions to how language learners, as familiar classmates withtheir own history of interacting with one another and undertaking tasks, woulddo classroom tasks or engage with interlocutors in natural, non-experimentalsettings.2 Furthermore, while the research may speak to issues of how theywould engage in one particular type of task, it does not shed light on howcurriculum can be developed linking such tasks in meaningful, educationallysound ways. Most SLA studies continue to examine L2 learners of English orother European languages and much less often European-L1 learners of non-European L2s (e.g., English learners of Chinese or Arabic; see Duff & Li 2004,on this point). They have also privileged a small set of fairly basic tasks associ-ated with communicative language teaching (e.g., spatial “spot-the-difference”or “plant the garden” tasks), and have not explored other instructional con-texts such as EFL instruction to the same extent, and far too seldom investigateinteractions or language development over an extended period of time. Theseissues, in my view, also reduce the generalizability and utility of the findingsof such studies, no matter how rigorously they are conducted. The onus is onresearchers in quantitative studies to convincingly demonstrate the external va-lidity of their findings (if that is their objective), rather than take it for grantedthat generalizability is possible in quantitative research but categorically im-possible in qualitative research.

Table 1 captures some basic differences between quantitative vs. qualita-tive research, especially in relation to generalizability and the strengths andweaknesses inherent in each paradigm with respect to validity concerns, morebroadly. Here I summarize some of the most commonly cited differences.Whereas quantitative research emphasizes both internal and external validity(or generalizability), in addition to reliability, qualitative research, especiallypostpositivist, naturalistic, interpretive studies, typically emphasize elementsassociated with a combination of internal validity and reliability (to borrowterms from quantitative research). Internal validity in quantitative research,



like its counterpart in qualitative research, is related to the credibility of re-sults and interpretations, as a function of the conceptual foundations and theevidence that is provided. As Krathwohl (1993) puts it, internal validity inquantitative research is related to the study’s “conceptual evidence linking link-ing the variables,” supported by “empirical evidence linking the variables” –demonstrated results, the elimination of alternative explanations, and judg-ments of the overall credibility of the results (p. 271). In qualitative research,internal validity is addressed by means of contextualization; thick description;holistic, inductive analysis; triangulation (or “crystallization,” to use a moremultifaceted metaphor; Richardson 1994); prolonged engagement; ecologicalvalidity of tasks; and a recognition of the complex and dynamic interactionsthat may exist among factors; as well as the need for the credibility or trustwor-thiness of observations and interpretations (Davis 1995; Watson-Gegeo 1988).Thick description, one of the most touted strengths of case study and ethnog-raphy (but not of conversation analysis or certain other kinds of qualitativeinquiry), may draw on the following sources of information: documentation,archival records, interviews, direct observations, participant-observation, andphysical artifacts (Yin 2003). Gall, Gall, and Borg (2003) suggest that a suitablythick description of research participants and concepts allows “readers of a casestudy report [to] determine the generalizability of findings to their particularsituation or to other situations” (p. 466). The aim is to understand and accu-rately represent people’s experiences and the meanings they have constructed,whether as learners, immigrants, teachers, administrators, or members of aparticular culture.

Qualitative research in education, according to Schofield (1990), first be-gan to address issues of generalizability because of large-scale, primarily quan-titative, multi-method program evaluation research in the 1980s and 1990sthat incorporated significant qualitative components and yet, given the over-arching quantitative structure, still framed discussions in terms of gener-alizability. She observed that a general “rapprochement” of the two majorparadigms since then, emphasizing their complementarity as opposed to fun-damental incompatibility, has further prompted researchers to examine reli-ability and validity or their proxies. Although it is often said that qualitativeresearch is neither interested in nor able to achieve generalizability or to gen-erate causal models or explanations, there are in fact many diverging opinionson this issue, as we will see in what follows (Bogdan & Biklen 1992).

In applied linguistics, in general, qualitative research seeks to produce anin-depth exploration of one or more sociocultural, educational, or linguisticphenomena and, in some cases, of participants’ and researchers’ own position-


Patricia A. Duff

ality and perceptions with respect to the phenomena (Davis 1995). Rather thanunderstand a phenomenon in terms of its component parts, the goal is to un-derstand the whole as the sum or interaction of the parts (Merriam 1998).Generalizability to larger populations, in the traditional positivist sense, orprediction is not the goal. As Stake (2000) observes with respect to case stud-ies, “the search for particularity [in a case, or a biography] competes with thesearch for generalizability” (p. 439).

Indeed, for many qualitative researchers, the term generalizability itself isconsidered a throw-back to another era, paradigm, ethos, and set of termi-nology (or discourse) in research. Schofield (1990) stated it as follows: “[t]hemajor factor contributing to the disregard of the issue of generalizability inthe qualitative methodological literature appears to be a widely shared viewthat it is unimportant, unachievable, or both” (p. 202). She attributes this dis-interest in part to the origins of much (ethnographic) qualitative research tocultural anthropology, which studies other cultures for their intrinsic value:for revealing the multiple but highly localized ways in which humans live.Cronbach (1975, cited in Merriam 1998) suggested that social science – andnot just qualitative – research should not seek generalizability anyway: “Whenwe give proper weight to local conditions, any generalization is a working hy-pothesis, not a conclusion” (Merriam 1998:209). Note that the associationof the replacement term “working hypotheses” for “generalization” by somequalitative researchers is not universally accepted though because it downplaysthe affective, interactive, emergent nature of perspective-sharing in qualita-tive research and again uses terminology associated with quantitative research(Merriam 1998).

Instead, the assumption is that a thorough exploration of the phe-nomenon/a in one or more carefully described contexts – of naturalistic orinstructed L2 learners with various attributes, classrooms implementing a neweducational approach, or diverse learners integrated within one learning com-munity – will be of interest to others who may conduct research of a similarnature elsewhere. Other readers may simply seek the vicarious experience andinsights gleaned from gaining access to individuals and sites they might other-wise not have access to. Stake (2000 and elsewhere) refers to the learning andenrichment that proceeds in this way as “naturalistic generalization” – learningfrom others’ experiences. Adler and Adler (1994) suggest that researchers’ writ-ing or reporting style itself contributes a great deal to the impact of qualitativeresearch on readers and their perceptions of its credibility and authenticity andthat researchers should strive to achieve what they call “verisimilitude” in their



writing – “a style of writing that draws the reader so closely into the subjects’worlds that these can be palpably felt” (p. 381).

Another term commonly replacing generalizability in the qualitative liter-ature is transferability – of hypotheses, principles or findings (Lincoln & Guba1985). Transferability (also called comparability) assigns the responsibility toreaders to determine whether there is a congruence, fit, or connection betweenone study context, in all its complexity, and their own context, rather thanhave the original researchers make that assumption for them. Still, some qual-itative researchers find the concept of transferability to be too similar in focusto generalizability to be a useful departure from traditional views and makestoo much of the need for similarity or congruence of studies (e.g., Donmoyer1990). They feel that difference, in addition to similarity, helps sharpen andenrich people’s understandings of how general principles operate within a fieldbeyond what the notion of transferability suggests. Also, rather than seeking“the correct interpretation,” they would aim to broaden the repertoire of pos-sible interpretations and narratives of human experience. Qualitative research,in this view, provides access to rich data about others’ experience that can fa-cilitate understandings of one’s own as well as others’ contexts and lives, boththrough similarities and differences across settings or cases.

In summary, in both quantitative and qualitative research, it is importantto establish the basis for one’s findings and interpretations, but in the lat-ter, strictly causal inferences, statistical inferences or universal “laws” are notnormally sought or warranted. Also, the nature of ontological “truth” claimssought or made are different in that in quantitative research it is assumedthat there is an external truth or reality to be discovered whereas most qual-itative research (even positivist case study methodology as advocated by Yin2003) would assume that there are multiple possible “truths” to be uncoveredor (co-)constructed, which may not always converge. However, qualitative re-search is not only intended to be of limited, highly local significance withoutaddressing issues of broader relevance and meaning. Much qualitative researchdoes seek to provide generalizations at an abstract conceptual or theoreticallevel, what Firestone (1993) refers to as analytic generalization, well beyond thedetails of cases (e.g., related to models of SLA or SLE). Some qualitative re-search, furthermore, seeks to provide causal explanations about observations(Miles & Huberman 1994), in addition to thick description. Thus, general-izability and the nature of explanations that research can support have beenproblematized in both qualitative and quantitative research in our field.


Patricia A. Duff

Inference and generalizability in qualitative research: Case studyand ethnography

In case studies and classroom ethnographies, it is necessary for researchers toacknowledge the delimited or context- (or culture-) bounded nature of theirobservations and findings while at the same time providing rich, detailed de-scriptions of the sites, participants and other objects of inquiry. Ordinarily, asmall number of main themes or different participant profiles emerging fromiterative data analysis are discussed, along with complex interactions amongvariables or factors linked with observed behaviors or situations. The obser-vations are typically contextualized both within socio-educational settings andwithin the larger theoretical research literature and the set of issues that mo-tivated the study. If done well, the metaphors, models, themes, and constructsthat emerge may influence and inform subsequent research.

Most case studies and ethnographies are not characterized or evaluatedin terms of their generalizability, but rather in terms of more social, ethi-cal, textual, and data-analytic concepts such as: contextualization or contextualcompleteness, thick description; prolonged engagement (longterm observation),complexity, triangulation, multivocality (multiple voices or perspectives), cred-ibility, relevance, plausibility, (researcher) positionality and reflexivity, trustwor-thiness, authenticity, usefulness, and chains of evidence. Most of these termsrelate to practices of providing convincing evidence and arguments, from avariety of sources, to support interpretations (or inferences). Altheide andJohnson (1994) call these aspects of “interpretive validity.” I discuss these con-cepts in the context of some recent SLA/SLE research below and review ways inwhich some researchers recommend that generalizability or transferability beenhanced in qualitative inquiry.

Contextualization, complexity, and credibility

ContextualizationA crucial aspect of qualitative research in applied linguistics is understand-ing and documenting the research context (Johnson 1992): that includes thelarger sociopolitical or historical context (where relevant), the participants andtheir interests, the tasks or instructional practices used and participants’ under-standings or views of these, in some cases, and how the research itself, whetherinside a classroom or in a research office of some kind, creates a special soci-olinguistic context, system, or ecology that is temporally as well as socially anddiscursively situated (van Lier 1988, 1997). Understanding learning communi-



ties as ecologies or organic systems is increasingly being stressed now (Kramsch2002), as is the importance of local contexts, including discourse contexts, andcultures of language learning and use (Breen 1985; Duranti & Goodwin 1992;Holliday 1994). Even in non-ethnographic inquiry, task-based research canbe analyzed with a view to understanding the kinds of contexts that are be-ing created by tasks and the impact of those contexts on the generation andinterpretation of findings over time (Coughlan & Duff 1994; Duff 1993b).

In my studies examining educational systems in transition as a result ofchanging social/linguistic demographics, L2 policies, curricula, and so on, doc-umenting these larger contexts in which more focused observations are situ-ated and understanding macro-micro interfaces to a greater extent has beenkey (Duff 1995; Duff & Early 1996). By that I mean seeing how the larger so-ciopolitical structure not only influences and mirrors, but is also constitutedin, the events and interactions in everyday classrooms.

For example, my ethnographic classroom research in Hungary (Duff1993c, 1995, 1996, 1997) featured three diverse sites (schools) as cases in or-der to examine issues connected with an innovative dual-language (Englishimmersion/bilingual) public education system: one in the capital city, one in asmall resort town, and the last in a relatively distant provincial capital. By se-lecting heterogeneous contexts, in terms of geography, administration, humanand other resources, student characteristics, and so on, it was possible to com-pare and contrast conditions, learning processes, and outcomes across the threeinstances of implementing the new model of education rather than take thefindings from the most accessible and best-resourced site (in the capital city)as typical or representative. This cross-case comparison of different schools andmultiple teachers and students within each school gave me a better understand-ing of what was or was not general in the implementation of dual-languageeducation and in classroom discourse practices specifically and revealed howeach school was dealing with sociopolitical and educational change as well ascurricular change. It also necessitated analyzing the macroscopic social, politi-cal, historical, and educational contexts in which school reforms, including theintroduction of English teaching and immersion and a more pro-Western cur-riculum and approach to teaching, were being introduced. At the same time,the local contexts and classroom discourse practices (e.g., in connection withrecitation and assessment activities) were evidence of precisely the kinds ofambivalence, struggle and transformation that were witnessed at the nationallevel and region, in the wake of Hungary’s restored political autonomy, thedissolution of the USSR, and democratization in Central Europe.


Patricia A. Duff

ComplexityThe theme of complexity in human behavior and interactions is foregroundedin an essay by Donmoyer (1990) on generalizability in case studies. He givesan account of a debate a few decades ago in psychology regarding whetherall human behavior can be studied and described in terms of regularities incauses and predictable effects. According to Donmoyer, the eminent psychol-ogy scholar Cronbach, who had once advocated for ultimate generalizability,finally “concluded that human action is constructed, not caused, and that to ex-pect Newton-like generalizations describing human action, as Thorndike did,is to engage in a process akin to ‘waiting for Godot”’ (Cronbach 1982, cited inDonmoyer 1990:178). That is to say that the complex interactions that under-lie human behavior need to be understood as co-constructed, unpredictablephenomena within society and within individuals. Capturing this complexityhas long been a common theme and pursuit in qualitative research.

Larsen-Freeman (1997, 2002) has in recent years examined complexity inresearch in the natural and social sciences, generally, and in applied linguistics,in particular. Drawing on chaos and complexity theory (C/CT), a perspectiveoriginally developed to explain phenomena in the physical sciences, she has at-tempted to apply principles of C/CT to second language acquisition, to explain“mechanisms of acquisition, [the] definition of learning, the instability andstability of interlanguage, differential success, and the effect of instruction” (p.152). Natural systems, she reports, are now understood by many scientists to be“dynamic, complex, nonlinear, chaotic, unpredictable, sensitive to initial con-ditions, open, self-organizing, feedback sensitive, and adaptive” (1997:142).They change over time, usually consist of many component parts or agentsthat must be analyzed in light of the whole system, and are highly interactive,contingent, and interdependent, which is why complex systems are often un-predictable and their behaviors and properties are emergent (Larsen-Freeman1997). Her own interests are principally grammar, language, and the processesof learning and using language. Qualitative research in applied linguistics is notrestricted to these domains of course, although much second-language research(both quantitative and qualitative) focuses on these. If the diffusion of innova-tion were the object of inquiry (e.g., in language education curriculum reforms,in diachronic linguistic change, in the brain’s functioning and organization,or in code-switching practices), then a complexity analysis would include thechange processes, agents, and incremental (or dramatic, non-incremental) de-velopments themselves, operating at the micro-level, in a bottom-up fashion,but viewed also within the context of macro-level or top-down influences.



Attempting to document highly complex social and linguistic phenomenain applied linguistics and also arrive at more general observations about trends,interrelationships among factors, and outcomes of interest to the field can bea difficult balancing act. A long, overly complicated story is difficult to tell andperhaps even more difficult for readers to transfer to other contexts or to re-late to existing theories or models of learning, performance, or institutionalchange, more generally. In short, it becomes hard to see the proverbial for-est for the trees and for the other elements and interactions foregrounded inthe ecosystem. In case studies, ethnographies, and other types of mainstreamqualitative research, thick description and triangulation, combined with datamatrix displays, networks and cognitive maps that show the interaction andclustering among dimensions can help readers appreciate the complexity ofthe case(s) while at the same time situate the observations within a coherentand accessible conceptual framework (Miles & Huberman 1994).

CredibilityIn my ethnographic classroom research in high school classrooms in Canada(Duff 2001, 2002b, 2004), I describe the changing demographics of urban ar-eas and schools in Vancouver, as a result of immigration primarily and theimpact of those changes on schooling. I tried to enhance the credibility of myfindings regarding the experiences of immigrant non-native English speakers(NNESs) in mainstream content classes in the following ways: I selected twoteachers, one male and one female, both deemed to be good models by theirprincipal and peers. Both taught the same social studies course, one that is re-quired for all Grade 10 students. The course, furthermore, is one of the firstthat newly mainstreamed NNESs must take. Both courses had a number ofNNESs in them (one more than the other though) and the lessons were reg-ularly observed and audio/video-recorded over an extended period of time(from six months to most of the academic year). Teachers and students (bothNNESs and native speakers) were interviewed about their perceptions and ex-periences regarding students’ participation in classroom discussions and in theschool community, and about aspects of classroom discourse and content thatproved most challenging. Other teachers and administrators in the school werealso interviewed about these themes and relevant documents (e.g., an officialaccreditation report, students’ writing, the course textbook) were consulted.Data analysis involved an ongoing examination of all the transcribed datafor salient, recurring themes, examples of representative classroom interactionpatterns and topics, and comparisons across the two classes for similarities anddifferences across themes. One sort of focal activity type was selected for anal-


Patricia A. Duff

ysis and comparison across lessons and courses: namely, teacher-fronted classdiscussions of current events and recently viewed educational films.

Missing from the data analysis to date, however, is a comprehensive treat-ment of the entire, rather vast data set, which is unwieldy for a single journalarticle or book chapter. Instead, I have conducted theme-driven analyses ofillustrative excerpts or focal students, described with as much contextual in-formation as possible, and selected precisely because of their typicality (whichan analysis of the entire dataset and prolonged engagement in the field per-mits) and perceived theoretical import. Would another researcher draw thesame inferences as I have from examining the same dataset – e.g., about therole and functions of pop culture in classroom discourse, or about the non-overt-participation of NNESs in mainstream courses? Not necessarily, but quitepossibly they would. They would certainly note the silence of NNESs in classdiscussions from an analysis of their participation patterns, although their in-terpretations and explanations might differ. In fact, readers are able to drawtheir own conclusions if enough data is presented to them in the body of an ar-ticle or appendix. Would others choose to focus on the same themes that I haveselected and that emerged from the data? Again, not necessarily. The choice ofthemes related to integration and participation in discourse and about chal-lenging aspects of academic discourse, the questions posed during interviews,and the focal speech events or activities singled out for analysis across lessonsor courses might well have been different. However, in such analyses the au-thenticity, credibility, and trustworthiness of the analysis and report is also aproduct of the amount and type of data presented to support the findings andthe provision of counter-examples, if any.

As Bromley (1986) argues, in the context of case study research,

A particularly subtle source of error in case studies is the absence of infor-mation and ideas. The possibilities that no-one thought of, and the facts thatwere not known, must have invalidated numerous case-studies, simply be-cause people’s attention tends to be concentrated on the information actuallypresented. It is important in case-studies and in problem-solving generally togo “beyond the information given” to “what might be” the case. Naturally,such speculations need to be backed up by reasoned argument and by a searchfor relevant evidence. (p. 238)

There is always the possibility, then, that other crucial pieces of information –or alternate explanations – have been overlooked. One way of addressing thislatter concern is by conducting not only a triangulation of data and meth-ods, but also a triangulation of theory and researchers – that is, interpreting



the same observations or data from multiple theoretical or disciplinary stand-points or by inviting other researchers to analyze the same dataset (see e.g.,Koschmann 1999, for an example of this in the analysis of discourse and in-teraction in a short medical tutorial). Naturally, in this kind of triangulation,datasets (e.g., transcripts, video clips) shared for analysis already represent acertain theoretical perspective and pre-selection, thus they cannot be construedas theory-neutral objects either.

Capturing participants’ (or emic) perspectives to bolster credibility

One technique that is often discussed in qualitative research as a way of en-suring the authenticity or credibility of interpretations – and also a way oftriangulating or verifying different perspectives and interpretations – is to con-duct “member checks” and have the teachers or students involved read writtenreports before they are published and then incorporate their feedback or cor-rections; or to consult with them during the analysis. I have only done memberchecks informally in the past, by means of regular, sometimes fleeting conver-sations before or after classes; furthermore, my interaction with past researchparticipants after data collection has ended has tended to be infrequent at best,simply because they or I have moved on to other endeavors and teachers andstudents tend to be very busy. In some of my studies, there has been some de-liberate follow-up at a later point though (e.g., Duff, Wong, & Early 2000; Duff2003) and in Hungary, I hired many of the students to transcribe classroomdata for me, so I got their perspectives through discussions of the transcriptsthey had done.

There is great potential value in interviewing participants about their per-spectives (either in their L1 or L2) before, during, or after a study, and especiallyminority students who, in my Canadian high school study, tended not to speakduring class discussions and who otherwise provided little analyzable class-room data, apart from silence or nonverbal behaviors (Duff 2002b). Interview-ing them provided insights into their English proficiency and their experiencesboth in and out of class. Obtaining teachers’ perspectives about why they dowhat they do, and understanding their dilemmas, histories using, say, a certainteaching approach or task, or about their perspectives on students’ comprehen-sion, performance, and so on can yield rich insights (Duff & Li 2004). In short,the triangulation of research methods, data, and participant perspectives (in-cluding my own analytic perspectives) to shed light on classroom phenomenaprovides a more multidimensional, richer image or composite and thus sys-


Patricia A. Duff

temic understanding than an individual snapshot or series of snapshots takenfrom a distance would.

However, obtaining participants’ perspectives may be appropriate for somequalitative studies or for certain kinds analyses (e.g., task-based behavior)but of limited usefulness when analyzing learners’ metalinguistic awarenessor use of particular forms (e.g., asking Mandarin speakers about their useof a particular aspect marker, le, or asking people about their L1-L2 code-switching), depending on participants’ educational background, proficiency,age, and self-awareness. The ability to ascertain participants’ perspectives de-pends very much on their L2 competence and ability to report on things intheir L2, unless the researcher understands their L1 or an interpreter is present.It also depends on participants’ metalinguistic or metacognitive awareness and,in the end, their accounts are, like many other forms of data, undeniably socialand narrative constructions.

Enhancing generalizability: Representativeness or typicalityof cases selected

One of the key ways in which case studies and ethnographies attempt to ap-proach generalizability, if indeed that is sought – or at least to avoid the crit-icism of unique or idiosyncratic findings whose impact on knowledge morebroadly may be difficult to assess or assert – is to choose typical or representa-tive sites or participants to study and not just those that are most convenientor easily accessible and whose representativeness is unclear. This is not a goalshared by all qualitative researchers though, who may choose to study peopleand sites already known to them or to which they have access; most acces-sible, of course, are the researchers’ themselves through introspective (e.g.,diary/narrative) studies, memoirs, and auto-ethnographies (Schumann 1997).

Here I provide an example of case selection from my own case study re-search and the issue of typicality. I loosely modeled my longitudinal SLA studyof a Cambodian immigrant to Canada learning English (Duff 1993a) afteran influential longitudinal study conducted by Huebner (1983) of a Hmong-speaking Laotian immigrant to Hawaii. I examined syntactic features in theinterlanguage of my research participant that had parallels in Huebner’s study.Both cases were refugees of a similar age and social class who came fromsomewhat similar, topic-comment Southeast Asian languages. And, while sim-ilarities emerged in how they expressed existentials in English, based on theirfirst language structures (e.g., in camp have/has many soldier instead of there



were many soldiers in the camp), in neither case were the learners selected be-cause of being typical, let alone “prototypical,” of learners from their sourcecommunities. To do so, would have required some prior or subsequent sam-pling from learners in the source communities, or a concurrent cross-sectionalstudy involving other Cambodians or Laotians or Hmong speakers, some kindof census data about the communities involved, or a multiple case study ad-dressing the issue of representativeness or variation. Rather, both subjects weresamples of convenience – willing, personable, and available research partici-pants whose English was developmentally intriguing.

Thus, it would be impossible to claim that other (or all) Hmong or Cambo-dian learners would have proceeded in English with exactly the same strategiesfor topic marking or existential constructions, for example, but the two stud-ies taken together, plus some interlanguage studies of Chinese and Japanesestudents who proceeded similarly (in part because of their similarly config-ured topic-comment L1s), suggested some patterns or potential repertoiresthat learners from those backgrounds are likely to avail themselves of (Duff1993a). Only with additional cases or supplementary studies (not necessarilylongitudinal or as in-depth) with learners from the same backgrounds would itbe possible to assert that specific interlanguage structures (e.g., the use of havevs. has as the default generic existential verb; or the use of isa as a topic marker)were shared by all Cambodian or Hmong ESL learners. The spirit of the “per-formance analysis” SLA studies of the day, moreover, was to see what learnerswere capable of doing, sometimes in highly creative and unique ways, with thelinguistic and cognitive resources at their disposal, and not just how they fellshort of target norms (Larsen-Freeman & Long 1991).

Furthermore, the typicality of a case within one context (e.g., a Cambodianlearner in Canada) may not equate with typicality in another (e.g., a Cambo-dian learner in Cambodia), and not all case study or ethnographic work in ourfield sets out to deal with typical or “normal” learners, teachers, or programs, inany event. Some are targeted for research precisely because of their atypicality,uniqueness, resilience, or even pathology in some instances; they are purpose-fully and opportunistically selected because of the insights they are expectedto generate about the possibilities of language learning and use. (See Peräkylä(1997) for a discussion of “possibilities” in connection with generalizability inconversation analysis.) For example, Genie was a famous case discussed in bothfirst and second language acquisition from the late 1970s (Curtiss 1977), whenher situation of abject neglect was first discovered and a team of researchersset out to study and help her. Genie, who had been deprived of normal humanlanguage and interaction for most of the first 13 years of her life, was seen to


Patricia A. Duff

be a test case for the critical period hypothesis: evidence that she could learnlanguage (English, her L1) successfully even after the onset of puberty, it wasasserted, might discredit the critical period hypothesis for language acquisition.In the end, Genie’s language development (e.g., morphology and syntax) wasquite modest and her lack of targetlike proficiency in English was difficult toexplain, precisely because of her extreme atypicality, but one explanation givenwas the existence of a critical period for language acquisition – which she hadpassed. The inferences drawn about Genie in connection with this hypothe-sis were perhaps not completely warranted, especially considering her highlyatypical social-psychological history, which included far more than just thepresence or absence of language learning opportunities, but also deprived herof normal human attachments, basic opportunities for cognitive stimulationand development, and so on. However, had Genie been able to achieve some-thing approximating “normal” native proficiency in English despite her earlydeprivation, the inferences and generalizations drawn from the study wouldhave been far more powerful and also remarkable.

Other atypical cases and situations that have been explored include highlysuccessful (or “talented”) and highly unsuccessful language learners (e.g., Ioup1989; Ioup, Boustagui, El Tigi, & Moselle 1994). The reason for selecting casesalong a continuum of experience, whether extreme cases, critical cases, or typ-ical cases, is to explore the range of human (linguistic) possibilities along aparticular dimension. Yet despite researchers’ frequent disclaimers and cau-tionary notes about the generalizability of their case studies, the theoreticalfindings from seminal early case studies in SLA (e.g., those in the 1970s and1980s by Schumann, Schmidt, Hatch and colleagues; e.g., Hatch 1978), as wellas more recent influential studies examining gender, ethnicity, and identity inSLA (e.g., Norton 2000; Norton & Toohey 2001), have nevertheless achievedfairly wide generalization within the field, giving rise to important new under-standings of second language learning. For example, generalizations have beenmade about the Acculturation Model (for and against) in relation to SLA, theimpact of noticing gaps on SLA, fossilization, language loss, and the role ofidentity, power, and motivation/investment in SLA. In fact, some of the mostinfluential generalizations and models of SLA have originated from just a smallhandful of case studies.

In fairness to the work and its legacy, this kind of theoretical (over) gen-eralization from cases sampled by convenience has not only occurred in hall-mark qualitative studies. Because of the lack of replication in many kinds ofquantitative SLA or classroom-oriented studies as well, findings from small(e.g. quasi-experimental or otherwise) one-off studies or even larger studies



conducted with a particular population, are similarly taken as proof that, forexample, gender (or task familiarity, partner familiarity, ethnicity, same-L1 sta-tus, etc.) does or does not make a difference in language learning or use, thatlearners do not generally learn from one another’s errors, and so on; or, inmore abstract terms, that variable-x has such-and-such effect on variable- orpopulation-y (or, when generalized, to possibly all language learners). In part,this extension of findings in the absence of widespread evidence is a symptomof a young field that may seek originality or novelty in studies, more so thanthe robustness, durability, or replicability of findings with different pools ofsubjects/participants and contexts, on which basis they can support theoreticalconclusions previously drawn.

Another method for enhancing the potential generalizability as well ascredibility of qualitative research is to conduct multisite or multiple-case stud-ies, not all of which may have the same attributes. Schofield (1990) asserts that“a finding emerging from the study of several very heterogeneous sites would bemore robust and thus more likely to be useful in understanding various othersites than one emerging from the study of several very similar sites” (p. 212).We could also replace “sites” with “people.” If we found similar developmen-tal trends across three Cambodian learners – the original subject referred toabove (a refugee with interrupted education), an instructed university student,and a child learner – it would add to the robustness of the original observa-tions or suggest alternate developmental pathways. On this same theme, inlieu of multiple cases or larger surveys of general trends, researchers may lookfor an aggregation of case studies (comparable to a meta-analysis of quantita-tive studies) to corroborate findings across studies. Stake (2000) refers to theseas collective case studies (of either a similar or different nature). Researchersmay also include participants’ own judgments about generalizability or repre-sentativeness to assist them with interpretations about typicality (Hammersley1992). Finally, longitudinal studies with data collected from multiple sourcesor task types also have the potential to increase the nature of inferences thatcan be drawn about learning because the developmental pathways (consistent,incremental, erratic, or very dynamic) can be shown, as well as interactions inthe acquisition of several interrelated structures over time (Huebner 1983).

The approach taken by Harklau (1994), Leki (1995), McKay and Wong(1996), Toohey (2000), and Willett (1995) in their longitudinal ethnographicclassroom-based or classroom-oriented studies in TESOL/applied linguisticswas to select three to five cases for in-depth analysis within the context ofclassrooms with diverse (e.g., immigrant, minority, local) learners (see Duffin press, for details). Sometimes focal cases in qualitative research are also


Patricia A. Duff

analyzed against the backdrop of a larger set of participants and data (e.g.,Kouritzen 1999; Li 1998). The authors may marshal various kinds of evidence(e.g., excerpts from interviews or classroom interactions) to support theirclaims. For example, Harklau’s article included 21 short excerpts from inter-view data taken primarily from students to support her observations, organizedaround the themes of spoken language use in the [mainstream] classroom, spo-ken language use in the ESL classroom, written language use in the mainstream,written language use in the ESL classroom, structure and goals of instruc-tion, explicit language instruction, and socializing functions of schooling. Leki(1995), on the other hand, focused on the challenges faced by three graduateand two undergraduate international (visa) students from Europe and Asia intheir first semester at an American university. Of interest was the English writ-ing requirements in their disciplinary courses across the curriculum and theircoping strategies, as newcomers to the local academic culture. The data pre-sented include well-rounded profiles of five focal students (the cases), followedby a description and discussion of ten general themes (strategies) that sur-faced across the five students’ experiences, as well as differences among them.Nine short quotations or excerpts from the students’ interviews, journals, orassignments were included from the corpus of transcribed data. Neither studyincluded excerpts of classroom discourse.

Most of my graduate students conducting case studies or ethnographiessimilarly select 4–6 focal participants for study in one or more sites (Duff &Uchida 1997; Morita 2002, 2004; Kobayashi 2003). Choosing 6 initially meansthat if there is attrition among participants, there will likely be 3–4 cases re-maining, providing multiple exemplars of the phenomenon under investiga-tion. Almost all have collected data through extensive and sustained classroomobservations, interviews, document analysis, artifacts (e.g., term papers, Pow-erPoint presentations), researcher fieldnotes or journals, and sometimes par-ticipants’ journals or logs as well. The greater the number of participants, casesor sites, however, the less possible it is to provide an in-depth description andcontextualization of each one, taking fully into account the complexity of in-teractions, the perspectives of the participants, and so on. However, havingmore than one focal case can provide interesting contrasts or corroborationacross cases. The cases are carefully selected from the larger set of potentialcandidates, based on oral proficiency scores, gender, age, length of time in aprogram, university major, or other such variables deemed important in rela-tion to the research questions. Exceptional cases, just like counter-examples tofindings, need to be explained.



Although Polio and Duff (1994) was neither an ethnography nor a casestudy in the core sense of being in-depth analysis of one context, by conductinga qualitative analysis of instructional language practices (in terms of L1 vs. L2use) across first year college-level courses in 13 different languages (rather thanjust one or two), many of which were typologically unrelated (e.g., Spanish,Korean, Polish, Hebrew), we were able to determine both the range of L1 usein L2 teaching at one particular university, and also some similarities acrossthe courses with regard to the functions of L1 use regardless of the L2. Al-though we couldn’t generalize the descriptive statistics about L1-L2 use fromthis university, the heterogeneity of the sample provided a good foundationfor the possible transferability of the findings to other universities or sites –and certainly raised questions for future research as to (1) how much L1 vs.L2 is used in foreign language classrooms; (2) what the functions of L1 vs. L2are in classroom discourse; and (3) how to increase L2 use – which we arguedwas a worthy and theoretically and pedagogically defensible goal. Subsequently,certain other researchers have questioned the premise (that more L2 use is bet-ter; e.g., Cook 2001) but that that has not hampered the transferability of thefindings regarding pervasive L1 use in certain foreign language courses andreasons for that.

In this section, I have discussed ways of trying to establish typicality, rep-resentativeness, or even maximum variation across cases, although many re-searchers choose their case study or ethnography sites and participants moreopportunistically. Stake (1995, 2000), a case study methodologist, differenti-ates between what he calls intrinsic and instrumental case studies (categoriesthat he concedes are best seen on a continuum) and does not insist on seekingrepresentativeness in all case studies. He also claims that studies differ in theclaims they may wish to make regarding generalizability:

I call a study an intrinsic case study if it is undertaken because, first and last,the researcher wants better understanding of this particular case. Here, it isnot undertaken primarily because the case represents other cases or because itillustrates a particular trait or problem, but because, in all its peculiarity andordinariness, this case itself is of interest. . .. The purpose is not to come tounderstand some abstract construct or generic phenomenon, such as literacyor teenage drug use or what a school principal does. The purpose is not theorybuilding – although at other times the researcher may do just that.

. . . I call it instrumental case study if a particular case is examined mainly toprovide insight into an issue or to draw a generalization. The case is of sec-ondary interest, it plays a supporting role, and it facilitates our understandingof something else. The case is still looked at in depth, its contexts scrutinized,


Patricia A. Duff

its ordinary activities detailed, but all because this helps the researcher to pur-sue the external interest. The case may be seen as typical of other cases ornot. . . (Stake 2000:437)

The notion of instrumental case study is related to the concept of analytic gen-eralization (generalization to theory not to populations), which I describe inthe next section. Stake also suggests that “[e]ven intrinsic case study can beseen as a small step toward grand generalization. . ., especially in the case thatruns counter to the existing rule” (p. 439). A good example of this is Schmidt’s(1983) analysis of Wes, a Japanese artist living in Hawaii whose English did notdevelop very well despite his high degree of acculturation in the local English-speaking American community. I do not believe that this research subject wasselected for his intrinsic interest value alone but, regardless, Schmidt used thiswell known case as a way of refuting (if not “falsifying”) Schumann’s Accul-turation Model; the model, based largely on Schumann’s case study of a CostaRican immigrant to America named “Alberto,” posited that acculturation is amajor causal factor in successful SLA (Schumann 1978).

Analytic generalization

My SLE research has examined the linguistic and/or interactional behavior ofstudents in a variety of types of classrooms. Increasingly, I have framed thework in terms of language socialization involving the ethnography of com-munication and/or case studies, and have added poststructural interpreta-tions of data in some cases, although the exact combination of methods andthe thematic focus depends on the research questions being addressed (Duff2002a). Studying a similar theoretical process (language socialization) acrossvery different contexts (in dual-language schools in Hungary, in heterogeneousCanadian high schools and vocational settings) provides a level of general-ization as well, about the construct or process in question (Duff 2003). Thegeneralizability, then, is not to populations but to theoretical models, oftencaptured in simple diagrams, which also take into account the complexity ofsecond language learning and the multiple possible outcomes that exist. How-ever, Bromley (1986) notes that even producing diagrams capturing processes,which is a form of data or theme reduction as well as representation, providesa level of abstraction and generality beyond the details of the local case(s) (seealso Miles & Huberman 1994, for illustrations of the multiple ways in whichdata can be visually presented). Such models and heuristics must be backed up



with logical reasoning and evidence that warrant the inferences that are drawnby the researcher or may be drawn by others (Bromley 1986).

Conclusion

In this chapter, I have presented some basic contrasts between the way theterms inference and generalizability are operationalized – or contested – inquantitative research and qualitative research, underscoring the “multiple re-search perspectives” promised in the subtitle of this book. I noted that manyqualitative researchers have rejected generalizability in the traditional sense asan achievable or desirable objective of research, although others with a morepositivist bent are more committed to it (e.g., Yin 2003; Miles & Huberman1994). Qualitative researchers embracing generalizability suggest that carefulsampling for representativeness, conducting multisite and multiple case stud-ies, providing careful chains of reasoning (inference and interpretation), andexplaining aberrant data or cases enhances the generality of the findings ofstudies. On the other hand, those rejecting traditional notions of generaliz-ability highlight elements more associated with internal validity and reliability(infrequently using those terms), and effective reasoning and writing instead:for example, by providing thick description, credible evidence, thorough dataanalysis, appropriate representation of contexts and data so that readers canlearn from others’ experiences and draw their own conclusions (or inferences)about transferability and relevance. I then provided examples of how my ownand others’ case studies and ethnographic research have either broached gener-alizability (and) or instead sought to foreground contextualization, complexityand credibility of studies of language education and learning.

A final comment relates to reporting in applied linguistics research. Nomatter how generalizable a study might be, the results of research are ofteninaccessible or incomprehensible to others because of the discourse associ-ated with the inquiry tradition. If readers are unable to easily comprehendthe results and understand the chains of reasoning, the new knowledge orinsights will not “travel” or be transferable to new settings. Instead, the find-ings may only resonate with a small number of like-minded researchers andthus may have less impact on the field than intended or deserved. One of thepotential benefits of qualitative research on its own or in combination withquantitative studies is that it can provide concrete, situated instances of anabstract phenomenon, which, when done well, may contribute meaningfullyto theory-building and to knowledge in the field. Qualitative research has its


Patricia A. Duff

own share of challenges (and diversity, as Lazaraton 2003, argues in her com-parison of ethnography and conversation analysis), foremost perhaps beingthe difficulty of coming up with a powerful, elegant and relatively straightfor-ward description of the complex relationships among many factors observedin cases or sites of interest, while at the same time demonstrating that the in-terpretations are solidly grounded in empirical data and observation. As newapproaches to research in applied linguistics emerge and evolve, the criteria,genres, and possibilities for reporting and evaluating research will undoubtedlychange as well and new perspectives on inference and generalizability will result(Duff 2002a). In addition, ideally more multi-method, multi-site and multina-tional/multilingual studies of SL acquisition and education phenomena willbe conducted to demonstrate the generality of the conditions and findings ofresearch in critical areas.

Notes

. In testing, there are particular statistical procedures connected with an area known asGeneralizability Theory as well.

. Another nagging problem with many small-scale quantitative studies in SLA/SLE is thatour means of quantifying and drawing statistical inferences is often less than ideal. Thereis a lack of robust, truly appropriate statistical procedures for the kinds of comparisonsof frequency/nominal data typical in discourse analyses (with repeated measures) of thesort often used in task-based research, for example. As Hatch and Lazaraton (1991) andLazaraton (2000) point out, in various analyses of linguistic production and interaction,multiple t-tests, ANOVAs, chi-square statistics, and factor analyses are often used inappro-priately; assumptions underlining the statistics may not be met or tests are used in ways forwhich they weren’t intended mathematically – e.g., by conducting multiple t-tests, by usingpowerful parametric tests with nonparametric data, or using nonparametric tests with thewrong kind of nonparametric data, especially for within-subject comparisons of nominaldata, such as in comparisons of linguistic constructions of one type produced by researchparticipants compared with other types of constructions the same people use when per-forming one or more different tasks. All of these difficulties may inadvertently compromisethe inferences and generalization claimed by researchers.

References

Adler, P. A. & Adler, P. (1994). Observational techniques. In N. K. Denzin & Y. S. Lincoln(Eds.), Handbook of qualitative research (pp. 377–392). Thousand Oaks, CA: Sage.



Altheide, D. L. & Johnson, J. M. (1994). Criteria for assessing interpretive validity inqualitative research. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitativeresearch (pp. 485–499). Thousand Oaks, CA: Sage.

Bogdan, R. C. & Biklen, S. K. (1992). Qualitative research for education (2nd ed.). Boston:Allyn and Bacon.

Bracht, G. H. & Glass, G. V. (1968). The external validity of experiments. AmericanEducational Research Journal, 5, 437–478.

Breen, M. (1985). The social context for language learning – A neglected situation? Studiesin Second Language Acquisition, 7, 135–158.

Bromley, D. B. (1986). The case-study method in psychology and related disciplines. New York,NY: John Wiley & Sons.

Brown, J. D. & Rogers, T. S. (2002). Doing second language research. Oxford: OUP.Chapelle, C. & Duff, P. (Eds.). (2003). Some guidelines for conducting quantitative and

qualitative research in TESOL. TESOL Quarterly, 37, 157–178.Cook, V. (2001). Using the first language in the classroom. Canadian Modern Language

Review, 57, 402–423.Coughlan, P. & Duff, P. (1994). Same task, different activities: Analysis of a SLA [second

language acquisition] task from an Activity Theory perspective. In J. Lantolf & G. Appel(Eds.), Vygotskian perspectives on second language research (pp. 173–193). New Jersey:Ablex.

Crabtree, B. & Miller, W. (Eds.). (1999). Doing qualitative research (2nd ed.). ThousandOaks, CA: Sage.

Creswell, J. (1998). Qualitative inquiry and research design: Choosing among five traditions.Thousand Oaks, CA: Sage.

Curtiss, S. (1977). Genie: A psycholinguistic study of a modern-day “wild child.” New York:Academic Press.

Davis, K. (1995). Qualitative theory and methods in applied linguistics research. TESOLQuarterly, 29, 427–453.

Denzin, N. & Lincoln, Y. (Eds.). (2000). Handbook of qualitative research (2nd ed.).Thousand Oaks, CA: Sage.

Donmoyer, R. (1990). Generalizability and the single-case study. In E. Eisner & A. Peshkin(Eds.), Qualitative inquiry in education: The continuing debate (pp. 201–232). New York,NY: Teachers College Press.

Duff, P. (1993a). Syntax, semantics, and SLA: The convergence of possessive and existentialconstructions. Studies in Second Language Acquisition, 15, 1–34.

Duff, P. (1993b). Tasks and interlanguage performance: An SLA [second languageacquisition] research perspective. In G. Crookes & S. Gass (Eds.), Tasks in languagelearning: Integrating theory and practice (pp. 57–95). Clevedon, UK: MultilingualMatters.

Duff, P. (1993c). Changing times, changing minds: Language socialization in Hungarian-English schools. PhD Dissertation, University of California, Los Angeles.

Duff, P. (1995). An ethnography of communication in immersion classrooms in Hungary.TESOL Quarterly, 29, 505–537.


Patricia A. Duff

Duff, P. (1996). Different languages, different practices: Socialization of discoursecompetence in dual-language school classrooms in Hungary. In K. Bailey & D. Nunan(Eds.), Voices from the language classroom: Qualitative research in second languageacquisition (pp. 407–433). New York: CUP.

Duff, P. (1997). Immersion in Hungary: An EFL experiment. In R. K. Johnson & M. Swain(Eds.), Immersion education: International perspectives (pp. 19–43). New York, NY: CUP.

Duff, P. (2001). Language, literacy, content and (pop) culture: Challenges for ESL studentsin mainstream courses. Canadian Modern Language Review, 58, 103–132.

Duff, P. (2002a). Research approaches in applied linguistics. In R. Kaplan (Ed.), The Oxfordhandbook of applied linguistics (pp. 13–23). Oxford: OUP.

Duff, P. (2002b). The discursive co-construction of knowledge, identity, and difference: Anethnography of communication in the high school mainstream. Applied Linguistics, 23,289–322.

Duff, P. (2003, March). New directions and issues in second language socialization research.Plenary talk presented at the annual meeting of the American Association for AppliedLinguistics, Arlington, Virginia.

Duff, P. (2004). Intertextuality and hybrid discourses: The infusion of pop culture ineducational discourse. Linguistics and Education, 14(3–4), 231–276.

Duff, P. (in press). Qualitative approaches to second language classroom research. In J.Cummins & C. Davison (Eds.), Handbook of English language teaching. Dordrecht:Kluwer.

Duff, P. & Early, M. (1996). Problematics of classroom research across sociopoliticalcontexts. In S. Gass & J. Schachter (Eds.), Second language classroom research: Issuesand opportunities (pp. 1–30). Hillsdale, NJ: Lawrence Erlbaum.

Duff, P. & Li, D. (2004). Issues in Mandarin language instruction: Theory, research, andpractice. System, 32, 443–456.

Duff, P. & Uchida, Y. (1997). The negotiation of teachers’ sociocultural identities andpractices in postsecondary EFL classrooms. TESOL Quarterly, 31, 451–486.

Duff, P., Wong, P., & Early, M. (2000). Learning language for work and life: The linguisticsocialization of immigrant Canadians seeking careers in healthcare. Canadian ModernLanguage Review, 57, 9–57.

Duranti, A. & Goodwin, C. (Eds.). (1992). Rethinking context: Language as an interactivephenomenon. Cambridge: CUP.

Edge, J. & Richards, K. (1998). May I see your warrant, please?: Justifying outcomes inqualitative research. Applied Linguistics, 19, 334–356.

Eisner, E. & Peshkin, A. (1990). Qualitative inquiry in education: The continuing debate. NewYork, NY: Teachers College Press.

Firestone, W. A. (1993). Alternative arguments for generalizing from data as applied toqualitative research. Educational Researcher, 22, 16–23.

Fraenkel, J. R. & Wallen, N. E. (1996). How to design and evaluate research in education (3rded.). New York: McGraw Hill.

Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research (6th ed.). White Plains,NY: Longman.

Gall, M. D., Gall, J. P., & Borg, W. T. (2003). Educational research (7th ed.). White Plains,NY: Pearson Education.



Hammersley, M. (1992). Some reflections on ethnography and validity. Qualitative Studiesin Education, 5, 195–204.

Harklau, L. (1994). ESL versus mainstream classes: Contrasting L2 learning environments.TESOL Quarterly, 28, 241–272.

Hatch, E. (Ed.). (1978). Second language acquisition. Rowley, MA: Newbury House.Hatch, E. & Lazaraton, A. (1991). The research manual: Design and statistics for applied

linguistics. Boston, MA: Heinle and Heinle.Holliday, A. (1994). Appropriate methodology and social context. Cambridge: CUP.Holliday, A. (2002). Doing and writing qualitative research. Thousand Oaks, CA: Sage.Huebner, T. (1983). A longitudinal analysis of the acquisition of English. Ann Arbor, MI:

Karoma Publishers.Hulstijn, J. & DeKeyser, R. (Eds.). (1997). Testing SLA theory in the research laboratory.

Special issue, Studies in Second Language Acquisition, 19, 2.Ioup, G. (1989). Immigrant children who have failed to acquire native English. In S. Gass,

C. Madden, D. Preston, & L. Selinker (Eds.), Variation in second language acquisition:Psycholinguistic issues (pp. 160–175). Clevedon, UK: Multilingual Matters.

Ioup, G., Boustagui, E., El Tigi, M., & Moselle, M. (1994). Re-examining the critical periodhypothesis: A case study of successful adult second language acquisition in a naturalisticenvironment. Studies in Second Language Acquisition, 16, 73–98.

Johnson, D. M. (1992). Approaches to research in second language learning. New York, NY:Longman.

Kobayashi, M. (2003). The role of peer support in ESL students’ accomplishment of oralacademic tasks. Canadian Modern Language Review, 59, 337–368.

Kouritzen, S. (1999). Face[t]s of first language loss. Mahwah, NJ: Lawrence Erlbaum.Koschmann, T. (1999). Meaning making. Special issue of Discourse Processes, 27(2).Krathwohl, D. (1993). Methods of educational and social science research. White Plains, NY:

Longman.Kramsch, C. (Ed.). (2002). Language acquisition and language socialization: Ecological

perspectives. New York, NY: Continuum.Larsen-Freeman, L. (2002). Language acquisition and use from a chaos/complexity theory

perspective. In C. Kramsch (Ed.), Language acquisition and language socialization (pp.33–46). New York, NY: Continuum.

Larsen-Freeman, D. & Long, M. H. (1991). An introduction to second acquisition research.New York, NY: Longman.

Lather, P. (1991). Getting smart: Feminist research and pedagogy within the postmodern. NewYork: Routledge.

Lazaraton, A. (2000). Current trends in research methodology and statistics in appliedlinguistics. TESOL Quarterly, 34, 175–181.

Lazaraton, A. (2003). Evaluating criteria for qualitative research in applied linguistics:Whose criteria and whose research? Modern Language Journal, 87, 1–12.

Leki, I. (1995). Coping strategies of ESL students in writing tasks across the curriculum.TESOL Quarterly, 29, 235–260.

Li, D. (1998). Expressing needs and wants in a second language: An ethnographic study ofChinese immigrant women’s requesting behavior. Unpublished doctoral dissertation,Teachers College, Columbia University, New York.


Patricia A. Duff

Lincoln, Y. & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage.Maxwell, J. A. (1996). Qualitative research design: An interactive approach. Thousand Oaks,

CA: Sage.McKay, S. L. & Wong, S. C. (1996). Multiple discourses, multiple identities: Investment and

agency in second-language learning among Chinese adolescent immigrant students.Harvard Educational Review, 66, 577–608.

Merriam, S. (1998). Qualitative research and case study applications in education (2nd ed.).San Francisco, CA: Jossey-Bass.

Miles, M. & Huberman, A. M. (1994). Qualitative data analysis (2nd ed.). Thousand Oaks,CA: Sage.

Morita, N. (2002). Negotiating participation in second language academic communities: Astudy of identity, agency, and transformation. PhD Dissertation, University of BritishColumbia.

Morita, N. (2004). Negotiating participation and identity in second language academiccommunities. TESOL Quarterly, 38, 573–603.

Neuman, W. L. (1994). Social research methods: Qualitative and quantitative approaches (2nded.). Boston: Allyn & Bacon.

Norton, B. (2000). Identity and language learning: Gender, ethnicity and educational change.London: Pearson Education.

Norton, B. & Toohey, K. (2001). Changing perspectives on good language learners. TESOLQuarterly, 35, 307–322.

Palys, T. (1997). Research decisions: Quantitative and qualitative perspectives (2nd ed.).Toronto: Harcourt, Brace, Jovanovich.

Peräkylä, A. (1997). Reliability and validity in research based on tapes and transcripts. InD. Silverman (Ed.), Qualitative research: Theory, method and practice. Thousand Oaks,CA: Sage.

Polio, C. & Duff, P. (1994). Teachers’ language use in university foreign language classrooms:A qualitative analysis of English and target language alternation. Modern LanguageJournal, 78, 313–326.

Richardson, L. (1994). Writing: A method of inquiry. In N. K. Denzin & Y. S. Lincoln (Eds.),Handbook of qualitative research (pp. 516–529). Thousand Oaks, CA: Sage.

Schmidt, R. (1983). Interaction, acculturation and the acquisition of communicativecompetence. In N. Wolfson & E. Judd (Eds.), Sociolinguistics and language acquisition(pp. 137–174). Rowley, MA: Newbury House.

Schmidt, R. & Frota, S. (1986). Developing basic conversational ability in a second language:A case study of an adult learner of Portuguese. In R. Day (Ed.), Talking to learn:Conversation in second language acquisition (pp. 237–326). Rowley, MA: NewburyHouse.

Schofield, J. W. (1990). Increasing the generalizability of qualitative research. In E. Eisner& A. Peshkin (Eds.), Qualitative inquiry in education: The continuing debate (pp. 201–232). New York, NY: Teachers College Press.

Schumann, J. (1978). The pidginization process: A model for second language acquisition.Rowley, MA: Newbury House.

Schumann, J. (1997). The neurobiology of affect in language. Malden, MA: Blackwell.Silverman, D. (2000). Doing qualitative research. Thousand Oaks, CA: Sage.



Stake, R. (1995). The art of case study research. Thousand Oaks, CA: Sage.Stake, R. (2000). Case studies. In N. Denzin & Y. Lincoln (Eds.), Handbook of qualitative

research (2nd ed., pp. 435–454). Thousand Oaks, CA: Sage.Toohey, K. (2000). Learning English at school: Identity, social relations and classroom practice.

Clevedon, UK: Multilingual Matters.van Lier, L. (1988). The classroom and the language learner. New York, NY: Longman.van Lier, L. (1997). Observation from an ecological perspective. TESOL Quarterly, 22, 783–

787.Watson-Gegeo, K. (1988). Ethnography in ESL: Defining the essentials. TESOL Quarterly,

22, 575–592.Willett, J. (1995). Becoming first graders in an L2: An ethnographic study of language

socialization. TESOL Quarterly, 29, 473–504.Yin, R. (2003). Case study research: Design and methods (3rd ed.). Thousand Oaks, CA: Sage.


Verbal protocols

What does it mean for research to usespeaking as a data collection tool?*

Merrill SwainThe Ontario Institute for Studies of The University of Toronto

What do verbal protocols represent? The response to this question differsdepending on the theoretical perspective one takes. I will examine the answerto this question from an information processing perspective and from asociocultural theory of mind perspective. The different assumptionsunderlying the two theories lead to different interpretations of verbalprotocol data. Research evidence is provided that suggests that verbalprotocols, particularly stimulated recalls, are a source of learning. Thissuggests that verbal protocols cannot be used neutrally as a method ofcollecting data in second language acquisition research, but instead they needto be considered as part of the “treatment” when making claims aboutlearning and development. Verbal protocols are not just “brain dumps”;rather they are a process of comprehending and reshaping experience – theyare part of what constitutes development and learning.

In psychological research, we are interested in determining the processes indi-viduals use when they engage in an activity. One source of information aboutthose processes is the individual him/herself – who tells the researcher aboutwhat he or she is thinking during or after the activity. The data generatedby this methodology are referred to as verbal protocols. The purpose of anydata collection is to draw inferences from the data collected. Inference, in thecontext of the present chapter, deals with the type of interpretation that re-searchers can make with regard to verbal protocol data. In this chapter, I intendto discuss the different inferences that are drawn by information processingtheorists and sociocultural (theory of mind) theorists from verbal protocols.My intent is to question inferences that are made within the more traditionalcognitive paradigm, arguing for alternative interpretations of verbal protocol


Merrill Swain

data based on sociocultural research and theory. I also consider the extent towhich we can generalize from inferences drawn from local and situated con-texts (Chaloub-Deville 2003). The ultimate purpose of this chapter is to arguethat we, as researchers, can no longer think of verbal protocols as a “neutral”methodology – that is, as a methodology that has no impact on our findings.

Gass and Mackey (2000) define verbal protocols as the data one gets “byasking individuals to vocalize what is going through their minds as they aresolving a problem or performing a task” (p. 13). Verbal protocols can take theform of concurrent “think alouds”, where individuals say what is going throughtheir mind while they are in the process of solving the problem or perform-ing the task (e.g., Cumming 1990). Or, verbal protocols can take the form ofsome sort of retrospective introspection, for example, a stimulated recall (e.g.,Mackey 2002; Swain & Lapkin 2002). In a stimulated recall, individuals areprovided with a stimulus which constitutes a bit of their past behaviour. Forexample, individuals may be shown a clip of a video in which they appear andare asked to talk about what was going through their minds at that particulartime (e.g., Swain & Lapkin in press).

The specific question I wish to address in this chapter is: (1) What do theseverbal protocols represent? This is followed by a section on relevant research.The implications for research on the different interpretations of verbalizing(speaking) are considered at the end of the chapter. Overall, I attempt to answerwhat it means for researchers in second language learning to use speaking as adata collection tool.

What do verbal protocols represent?

I examine this question from the perspective of two different theories of hu-man cognition: information processing theory (Ericsson & Simon 1993, 1998)as representative of recent (last three decades) thinking in cognitive science,and a sociocultural theory of mind (e.g., Lantolf 2000; Vygotsky 1978; Wertsch1985, 1991; Wertsch & Tulviste 1996). The issue which underlies the debateis no less than the relationship between language and thought – an issue that“has been little discussed in recent decades, since many have thought the issueto be closed” (Carruthers & Boucher 1998b:2) because of the dominance of thecomputer metaphor for the mind in the cognitive sciences (see e.g., Searle 2002for a critique of the metaphor). In this view, language is “but an input and out-put module for central cognition” (Carruthers & Boucher 1998b:2). This view,however, is being challenged, even within the cognitive sciences as indicated


Verbal protocols

by a recent book edited by Carruthers and Boucher (1998a) “Language andThought: Interdisciplinary Themes.” Also, this view is being challenged by thoseinfluenced by Vygotsky (1978) and his colleagues and students (e.g., Gal’perin1969; Luria 1973), whose writings have only reached North American secondlanguage theoreticians and researchers in the last decade or so (e.g., Lantolf2003; Lantolf & Appel 1994).

Carruthers and Boucher (1998b) suggest that there are roughly two op-posing camps among those who are interested in the place of language incognition. There are those who see “the exclusive function and purpose oflanguage to be the communication of thought, where thought itself is largelyindependent of the means of its transmission from mind to mind” (p. 1). In-formation processing theory falls into this camp. Alternatively, there are thosewho see language as “crucially implicated in human thinking. . .that languageitself is constitutively involved in [some kinds of thinking]” (p. 1). Languageis not simply a vehicle for communication, but plays critical roles in creating,transforming, and augmenting higher mental processes. Sociocultural theoryof mind falls into this second camp. The different assumptions underlyingthe two perspectives – information processing theory and sociocultural the-ory – lead to different interpretations of the data elicited in verbal protocols(Smagorinsky 1998).

According to information processing theory (Ericsson & Simon 1993),think alouds are a report of the (oral) contents of short-term memory, andrepresent a trace of the cognitive processes that people attend to while doing atask. In a stimulated recall, as Gass and Mackey (2000) point out, “the use ofand access to memory structures is enhanced, if not guaranteed, by a promptthat aids in the recall of information” (p. 17). In both cases the assumptionis that verbal protocols provide data for investigating cognition direct frommemory. According to Ericsson and Simon (1993:222), the verbalization “isa direct encoding of the heeded thought and reflects its structure”. The datatell us “what information [individuals] are attending to while performing theirtasks, and by revealing this information, can provide an orderly picture of theexact way in which the tasks are being performed: the strategies employed, theinferences drawn from information,. . .” (p. 220). In this way, verbal protocolsprovide the evidence from which models of human cognitive processing aregenerated.

Ericsson and Simon (1993) make a distinction between instructions to par-ticipants to verbalize thoughts per se, what they refer to as Type 1 and Type 2verbalization, and instructions to verbalize specific information, such as rea-sons and explanations (Type 3 verbalization). Type 1 and Type 2 verbalizations


Merrill Swain

do not, they claim, change the sequencing of the cognitive processes, but thetime to carry them out may be longer as a result of the verbalizing. As forType 3 verbalization, they summarize their review of the research literature as:“. . .directing subjects to engage in specific thought activities with associatedovert verbalization changes the cognitive processes and thus alters concurrentand retrospective performance” (p. xix). They continue, that “. . .the effects ofdirecting verbalization do not involve any magical influences but can be un-derstood in terms of the changes induced in the associated cognitive processby the instructions” (p. xix) (my italics). In other words, Ericsson and Simonunderstand the instructions as causal, not the verbalization.

. . .in the review of studies comparing different instructions to verbalize, wefound substantial evidence that differences in performance were induced bytelling the subject how to verbalize. In order to verbalize the information calledfor by the instructions, instead of the information he would normally haveattended to, he had to change his thought processes.

(Ericsson & Simon 1993:107)

As Vygotsky (1986:218) asked, however, “Does language only reflect thought(memory) or can it change thought (memory)?” Vygotsky believed that“thought is not merely expressed in words; it comes into existence throughthem” and that “thought undergoes many changes as it turns into speech: itfinds its reality and form” (p. 219). “The process of rendering thinking intospeech is not simply a matter of memory retrieval, but a process throughwhich thinking reaches a new level of articulation” (Smagorinsky 1998:172–173). Ideas are crystallized and sharpened, and inconsistencies become moreobvious. Smagorinsky (2001) makes clear the implication of this position: “Ifthinking becomes rearticulated through the process of speech, then the pro-tocol is not simply representative of meaning. It is, rather, an agent in theproduction of meaning” (p. 240). (See also Vygotsky 1997.)

In a sociocultural theory of mind, verbalization is conceived of as a toolthat enables changes in cognition. Speech serves to mediate cognition. Initiallyan exterior source of physical and mental regulation, speech takes on theseregulatory functions for the self. One’s own speech (through a process of inter-nalization) comes to regulate, organize, and focus an individual’s own mentalactivities (e.g., Luria 1959, 1973; Sokolov 1972). Clark (1998) refers to this roleof language in human cognition as “attention and resource allocation” (p. 172):speech helps us to focus our attention, monitor and control our behaviour.

Another way in which language intersects with the activities of the mindis that it allows ideas to be retained and held up for inspection by the self


Verbal protocols

and others; it allows ideas to move between people. Such movement allowsfor “the communal construction of extremely delicate and difficult intellec-tual trajectories and progressions. . .moreover, the sheer number of intellectualniches available within a linguistically linked community provides a stunningmatrix of possible inter-agent trajectories” (Clark 1998:172).1 This is whatVygotsky (1978) meant by proposing that the source of learning is social; andwhat Salomon (1993) and others mean by “distributed cognition”.2

Vygotsky (1987), Barnes (1992), Wells (1999), and others argue that speechcan serve as a means of development by reshaping experience. It serves as avehicle “through which thinking is articulated, transformed into an artifac-tual form, and [as such] is then available as a source of further reflection”(Smagorinsky 1998:172), as an object about which questions can be raised andanswers can be explored with others or with the self. Language is data, andwith language we are able to manipulate ideas, re-organize them, reshape them,transform them, and construct new ones. “The process of linguistic formula-tion creates the stable structure to which subsequent thinkings attach” (Clark1998:177). As Vygotsky (1987) argued, language is a tool which permits ourmind to engage in a variety of new cognitive operations and manipulations.“It enables us, for example, to pick out different elements of complex thoughtsand to scrutinise each in turn. It enables us to ‘stablise’ very abstract ideas inworking memory.3 And it enables us to inspect and criticize our own reasoningin ways that no other representational modality allows” (Clark 1998:178).

This being so, verbal protocols – which mediate the articulation of cogni-tion – have the power to influence cognition. They exert this influence in threeways. First, the process of verbalization itself transforms thought, drawingattention to some aspects of the environment and not others, solidifying mean-ing, and creating an observable artifact. Secondly, as an observable artifact, itcan be reflected upon, questioned, manipulated and restructured. And thirdly,internalization of this now differently understood externalized artifact may oc-cur. What this implies is that verbal protocols not only potentially transformthinking, focussing it in highly specific ways, but also are the sources of changesin cognition. In other words, speech mediates learning and development.

Some relevant research

In recent research, we have been attempting to demonstrate that speaking me-diates second language learning. Our initial work in this area with grade 7 and8 French immersion students has shown that speaking in the form of dialogue


Merrill Swain

Table 1. Overview of Swain and Lapkin’s (2000) study

Week 1 Week 2Tuesday Wednesday Thursday Monday Thursday Friday

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6Writing(Pretest)

Reformulation Noticing StimulatedRecall

Posttests Interviews

Studentsworktogether inpairs towrite a storyin French.

NS of Frenchreformulatesstory.

Students noticedifferencesbetween theirown story andthereformulatedversion of theirstory.This isvideo-taped.

Video ofnoticing isshown andstudents askedto verbalizewhat they werethinking at thetime the videowas made.

Studentsindividuallyrewrite theirstory.

Students areinterviewedindividuallyabout thevariousstages of thestudy.

mediates second language learning (e.g., Swain & Lapkin 1998). We have calledthe particular form of dialogue we have been investigating “collaborative di-alogue”.4 We have shown that through collaborative dialogue, learners cometo know what they do not know or know only partially about language, fo-cus their attention on aspects of language that are problematic for them, raisequestions about those problematic aspects of language and respond to thosequestions (formulate and test hypotheses), and, in so doing, consolidate theirexisting knowledge or create knowledge that is new for them.

In our more recent work (e.g., Swain & Lapkin 2002, in press), we haveadded stimulated recalls to our research procedures to try to understand learn-ing processes better from the learners’ perspectives. We have also incorporateda pretest/posttest design. This is shown in Table 1. Between the pretest andposttest, students (Grade 7 French immersion students) examine a reformula-tion of a story they have written and are asked to notice what differences thereare between the story they wrote and the reformulated version of their story.While the students are engaged in noticing the differences, they are video-taped. Next, the students see themselves noticing the differences between theirown story and its reformulated version, and the tape is stopped each timethe students noticed something and asked to tell what they were thinking atthe time.

An example of what happens is shown in Table 2. In their story, Nina andDara, two grade 7 French immersion students, had written “. . . elle s’endoresans bruit” – meaning “she fell asleep without a sound” and the reformulator


Verbal protocols

changed this to “. . .elle s’endore dans le silence” – meaning “she fell asleep inthe silence”, which altered the meaning of Nina and Dara’s original story. Ninaand Dara’s version puts the emphasis on how the girl in their story falls asleep,that is, without a sound; and the reformulator’s version highlights the state ofthe room, which is silent. Nina and Dara noticed this change, and during thestimulated recall, Nina articulates the difference in meaning. She says: “I thinksans bruit is more, she, she fell asleep and she didn’t make any noise. But silence islike everything around her is silent.” Here she puts into words the difference inmeaning between their version and the reformulator’s version, and when theylater individually rewrite their story, although they make use of the reformu-lator’s word, “silence”, they cleverly manage to preserve their original meaning,Dara by using “silencieusement” and Nina by using “en silence” – both meaning“silently”. Later, when interviewed, Nina makes the more general point aboutthe feedback that they received through the reformulation, “. . .some of them[the reformulations], they seemed like they changed the story sort of and it wasn’treally ours.” – which explains why and how they used this specific aspect of thereformulator’s feedback.

The key point here is that by verbalizing the difference in meaning be-tween the two versions, the students were able to accommodate the feedbackthey received, yet preserve their own meaning – a complex cognitive task. Theirfinal written versions (posttests) would seem to have been affected by both thereformulation itself, and a clear articulation of the differences between theirmeaning and the one that they felt was being imposed on them. It seems likelythat what the students said affected their posttest results. However, our researchdesign did not let us separate out the effect of the feedback from that of thestimulated recall – something that will need to be done to understand the effectof such verbalization.

In a recently completed Ph.D. study, because of the vagaries of doing class-room research in real time, Nabei (2002) separated the effects of the feedback(oral recasts from the teacher) from the stimulated recall. An overview of herstudy is shown in Table 3. As shown in Table 3, Nabei videotaped the teacher-student interaction in an EFL college class in Japan and took note of all therecasts that the teacher made of student utterances. Based on the specific re-casts that occurred, she developed test items that she then administered to thestudents. After that, she held a stimulated recall session with each student in-dividually in which she showed each student each recast episode and askedthem what they were thinking at the time. But, due to time constraints, it wasnot always possible to show each and every recast episode. This cycle of video-ing the class interaction, developing and administering posttest items based on


Merrill Swain

Table 2. Sans bruit/dans le silence: Nina and Dara

Week 1 Week 2Tuesday Wednesday Thursday Monday Thursday Friday

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6Writing-(Pretest)

Reformulation Noticing StimulatedRecall

Posttests Interviews

Nina andDara write:. . . elles’endore sansbruit. *

Reformulatorwrites: . . . ellese rendortdans le silence.

D: (readingfrom thereformulatedversion) Serendort ensilence.N: Qu’est-cequ’on a mis?D: Sans bruit.N: Okay.

N: I think sansbruit is more,she, she fellasleep and shedidn’t makeany noise. Butsilence is likeeverythingaround her issilent.

Nina writes:. . . elle serendore ensilence. Darawrites:. . . elles’endoresilencieuse-ment.

N: Some ofthem (thereformula-tions), theyseemed liketheychanged thestory sort ofand it wasn’treally ours.

* italics = what students and reformulator wrote

Table 3. Overview of Nabei’s (2003) study

the specific recasts provided by the teacher, then conducting stimulated recallsessions was repeated weekly for 6 weeks. Three weeks after completing these6 cycles of data collection, Nabei gave the students the same test items again –constituting the second posttest – in order to determine if the immediate learn-ing from the recasts was maintained over time. Because some recast episodeswere not shown to the students in their stimulated recall sessions, this meansthat in this second posttest, some of the items were ones where students hadspoken about the episode from which the test item was constructed, while forother items, no such stimulated recall took place.


Verbal protocols

Table 4. Percentage correct items on posttests

Items with no associated Items with associatedstimulated recall stimulated recall

Posttest 1 68% (19/28) 57% (78/138)Posttest 2 44% (12/28 64% (88/138)

The results are shown in Table 4. In the first posttest, given prior to anystimulated recall, there are two findings. On the items for which there was neverany associated stimulated recall, the average correct score was 68%, whereas onthe items on which there was later related stimulus recall data, the average cor-rect score was 57%. This suggests that the items where the learners provideda stimulated recall protocol may have been those that were more difficult forthem. The reason I suggest this is that, in the second posttest, on the itemswhere the learners had provided a stimulated recall protocol, their average cor-rect responses went from 57% to 64%; whereas on the items where a stimulatedrecall had not taken place, the learners’ average correct responses went from68% to 44%. This suggests that the stimulated recalls not only helped the stu-dents to maintain over time what they had learned from the recast feedback inclass, but also to further develop their knowledge.

Adams (2003) replicated the Swain and Lapkin (2002) study with univer-sity students of Spanish using a research design which made it possible forher to separate the effects of task repetition alone (students only wrote thepretest and the posttest), noticing (students, after writing the pretest, com-pared their writing to that of a reformulated version, then wrote the posttest),and stimulated recall (students wrote the pretest, noticed the differences be-tween their writing and the reformulation of it, and then immediately afterthe noticing session, the students recalled what they were thinking at the timeof their noticing, stimulated by listening to a recording of their noticing ses-sion). The posttest score for each learner was calculated as a proportion ofreformulations that were incorporated in a more target-like form to the totalnumber of reformulations. Both the Noticing Group and the Noticing + Stim-ulated Recall Group significantly outperformed the Task Repetition Group.In a further analysis, when the proportion of more target-like incorporatedreformulations to reformulations that the learners had reported noticing werecalculated, the Noticing + Stimulated Recall group significantly outperformedthe Noticing Group. These findings suggest that noticing the feedback pro-vided by the reformulation had an effect on the final scores students obtained,


Merrill Swain

and that the stimulated recall had an impact above and beyond that of noticingthe feedback.

What these results show is that speaking, in the form of a stimulated recall,positively affected the performance of language learners. Information process-ing theory might claim the results are explained by the participants havinghad more “time on task” and during that time on task, they were given anadditional exposure to the information about the correct response and theyattended to that information, strengthening their memory traces. For exam-ple, Ericsson and Simon (1993) reported on a study on vocabulary learning(Crutcher 1990) where half the items where followed by retrospective reports,and the findings showed that retention was better for those items that called forretrospective reports during learning. “This finding was expected, as the retro-spective reports involve an additional retrieval of the memory trace linking thevocabulary pair and hence serve as an additional rehearsal and strengtheningof the memory trace” (Ericsson & Simon 1993:xxi).

Sociocultural theory claims the results find their source in the verbalizationitself. Speaking was not just a report of thought (memory), but it shaped andbrought thought into existence.

In other educational domains such as mathematics and science, lan-guage has been shown to mediate the learning of conceptual content (e.g.,Newman, Griffin, & Cole 1989). The Russian developmental psychologist,Talyzina (1981) demonstrated in her research the critical importance of lan-guage in the formation of basic geometrical concepts. Talyzina’s researchwas conducted within the theoretical framework of Gal’perin (1969). WithNikolayeva, Talyzina conducted a series of teaching experiments (reported inTalyzina 1981). The series of experiments dealt with the development of basicgeometrical concepts such as straight lines, perpendicular lines, and angles.

Three stages were thought to be important in the transformation of mate-rial forms of activity to mental forms of activity:5 a material (or materialized)action stage; an external speech stage; and a final mental action stage. In thefirst stage, students are involved in activities with real (material) objects, spatialmodels or drawings (materialized objects) associated with the concepts beingdeveloped. Speech serves primarily as a means of drawing attention to phe-nomena in the environment (p. 112). In the second stage, speech “becomes anindependent embodiment of the entire process, including both the task and theaction” (p. 112). This was instructionally operationalized by having studentsformulate verbally what they carried out in practice (i.e., materially) – a kindof ongoing think-aloud verbalization.6 And in the final mental action stage,speech is reduced and automated, becoming inaccessible to self-observation


Verbal protocols

(p. 113). At this stage, students are able to solve geometrical problems withoutthe aid of material (or materialized) objects or externalized speech.

In one of the series of instructional studies conducted by Talyzina and hercolleagues, the second stage – the external speech stage – was omitted. The stu-dents in the study were average-performing, grade five students in Russia. Theperformance of students for whom the external speech stage was omitted wascompared to that of other students who received instruction related to all threestages. The researchers concluded that the omission of the external speech stageinhibited substantially the transformation of the material activity into a mentalone. They suggest this is because verbalization helps the process of abstract-ing essential properties from nonessential ones, a process that is necessary foran action to be translated into a conceptual form (p. 127). Stated otherwise,verbalization mediates the internalization of external activity.

Holunga (1994) conducted a study concerned with second language learn-ing that has many parallels to those carried out by Talyzina and her colleagues.Holunga’s research involved adults who were advanced second language learn-ers of English. The study was set up to investigate the effects of metacognitivestrategy training on the oral accuracy of verb forms. The metacognitive strate-gies taught in her study were predicting, planning, monitoring and evaluating(Brown & Palincsar 1981). What is particularly interesting in the present con-text is that one group of her learners was instructed, as a means of implement-ing the strategies, to talk them through as they carried out communicative tasksin pairs. This group was labelled the metacognitive with verbalization, or MV,group. Test results of this MV group were compared to those of a second groupwhich was also taught the same metacognitive strategies, and which carried outthe same communicative tasks in pairs. However, the latter group was not in-structed to talk about the metacognitive strategies as they implemented them.This group was called the metacognitive without verbalization, or M, group.A third group of students, included as a comparison group (C group), wasalso provided with language instruction about the same target items, verbs.Their instruction provided opportunities for oral language practice throughthe same communicative tasks completed by the other students, but the stu-dents in this group were not taught metacognitive strategies. Nor were theyrequired to verbalize their problem-solving strategies. Each group of studentsin Holunga’s study received a total of 15 hours of instruction divided into 10lessons. Each lesson included teacher-led instruction plus communicative tasksto be done in pairs.

The students in this study were tested individually, first by being askeda series of discrete-item questions in an interview-like format, and secondly


Merrill Swain

by being asked three open-ended questions in which learners would give theiropinions, tell a story and imagine a situation. The questions were designed toelicit specific verb forms concerning tense, aspect, conditionals and modals,and were scored for the accuracy of their use. A pretest, posttest and delayedposttest were given. The delayed posttest was administered four weeks afterthe posttest.

The data were analyzed statistically as four separate tests: The analyses re-vealed that the MV group made significant gains from pre- to posttests in allfour tests; the M group made significant gains in only the discrete-item ques-tions. And the C group showed no improvement on any of the four tests.Furthermore, both the MV and M groups’ level of performance at the posttestlevel was maintained through to the delayed posttests four weeks later. A sec-ond set of analyses indicated that both experimental groups performed betterthan the comparison group on all four tests. Furthermore, the MV group’sperformance was superior to that of the M group.

In summary, although those students who were taught metacognitivestrategies improved the accuracy of their verb use relative to a comparisongroup that received no such instruction, students who were taught to verbalizethose strategies were considerably more successful in using verbs accurately.

Interpreting these findings through the lens of Talyzina’s theoretical ac-count suggests that for the MV group, external speech mediated their languagelearning. Verbalization helped them to become aware of their problems, predicttheir linguistic needs, set goals for themselves, monitor their own language use,and evaluate their overall success. Their verbalization of strategic behaviourserved to guide them through communicative tasks allowing them to focus notonly on “saying”, but on “what they said”. In so doing, relevant content (i.e., theartifact that speech produces) was provided that could be further explored andconsidered. Test results suggest that their collaborative efforts, mediated by di-alogue, supported their internalization of correct grammatical forms. (See alsoHuang 2004; Negueruela 2003; Swain 2005.)

The studies reviewed above suggest that verbalization (speaking), particu-larly the verbalization that takes place as one reflects (the “saying”) on the ar-tifact created by speech (the “said”) plays a significant role in the developmentand learning that was demonstrated to have taken place.


Verbal protocols

Implications

What are the implications for research of the different interpretations of verbal-izing (speaking) made by information processing and sociocultural theorists?The inferences one anticipates drawing from verbal protocols are not dissimi-lar across these two theoretical perspectives. Both aim to develop claims aboutthe higher mental processes participants make use of in carrying out a spe-cific task, e.g., solving a mathematical problem, a logical reasoning problem,or a language problem. Information processing researchers use verbal proto-cols to develop and test “detailed information processing models of cognition,models that can often be formalized in computer programming languages andanalyzed by computer simulation” (Ericsson & Simon 1993:220). Sociocul-tural theorists also use verbal protocols to discover mental processes underlyingtask performance7 (Wertsch 1980; Donato & Lantolf 1990; Swain 2000, 2001).Both research agendas try to explain how and why people think and act: infor-mation processing by prediction (projecting into the future based on currentbehaviour); sociocultural theory by genetic analysis (analysis of the process(es)being formed).

Information processing theorists view verbal reports within the limitedconstraints of individual task performance, seeking to identify similaritieswithin or across group behavior, whereas sociocultural theories take a broaderperspective on such data, attempting to explain them in reference to long-termpersonal histories. Looking at verbal reports from this broader perspective,temporally and circumstantially, people are interacting and changing with allthey say and think, regulating themselves and the world around them. Verbalreports as indications of what people are attending to as they try to complete ashort task make people seem static and disembodied from their long-term in-dividual development and their social relations, and focused just on the goalsassociated with that task. Both theories are concerned with learning, but theextent of the perspective each adopts is different (Cumming, personal commu-nication, November 25th, 2003).

The heart of the matter lies in what is considered to be the relationshipbetween thought and language. For information processing theorists, the twoare the same, and verbal protocols are a direct encoding of the heeded thought(Ericsson & Simon 1993). For sociocultural theorists, thought is mediated bythe cultural artifacts of our situated being. One of the most important culturalartifacts is language. Through the process of speaking – the articulation andcompletion of thought – our attention may be refocussed, the boundaries of


Merrill Swain

thought may be expanded or limited, new ideas may be created, etc. In otherwords, verbalization changes thought, leading to development and learning.

Returning now to the original question, what do verbal protocols repre-sent? Do they represent cognitive “dumps”? or are they, instead, part of theprocess of cognitive change, that is, of learning and development? Informa-tion processing theory supports the former view. The research I presented inthis chapter suggests that the latter view is a strong possibility. It is certainlya matter that needs to be closely studied. If, as I have suggested, speaking andcognitive change can be closely allied, then this needs to be taken account ofin any study which makes use of verbal protocols. Verbal protocols cannotbe used neutrally as a method of collecting data, but instead they need to beconsidered as part of the “treatment” when making claims about learning anddevelopment. Research tools such as think alouds and stimulated recalls shouldbe understood as part of the learning process, not just as a medium of data col-lection (Smagorinsky 1998; Swain 2005). Think alouds and stimulated recallsare not, as some would have it, “brain dumps”; rather they are a process ofcomprehending and reshaping experience – they are part of what constitutesdevelopment and learning.

Notes

* I would like to thank Micheline Chaloub-Deville, Louis Chen, Alister Cumming, DavidIshii, Penny Kinnear, Jim Lantolf, Toshiyo Nabei, Sharon Lapkin and Harry Swain for read-ing earlier draft(s) of this chapter. Their comments led me to infer more and generalizeless.

. Clark is essentially a connectionist. In this regard, it is interesting how closely many of hisclaims (1998) echo the words of Vygotsky.

. Even in the case of a person acting alone, cognition is still distributed because of internal-ization. This is not a point forcefully made in Salomon (1993). (Personal communication,Lantolf, December, 2003.)

. Clark (1998) suggests that “for certain very abstract concepts, the only route to successfullearning may go via the provision of linguistic glosses. Concepts such as charity, extortionand black hole seem pitched too far from perceptual facts to be learnable without expo-sure to linguistically formulated theories. Language may thus enable us to comprehendequivalence classes that would otherwise lie forever outside our intellectual horizons” (p.170).

. Collaborative dialogue is dialogue in which speakers are engaged in problem-solvingand knowledge-building – in this case, solving linguistic problems and building knowledgeabout language.


Verbal protocols

. I.e., internalization – conversion of objective to idealized activity (Gal’perin 1969).

. The difference is that in a think aloud the individual is asked to say what they person-ally are doing, but in Talyzina the individual’s speech is supposed to reflect the conceptualunderstanding of the process provided by the instructor and materialized in some form orother.

. What is going on in speaking is a genetic process and so we should be able to improve thisprocess by modifying the things that people say about what it is they are doing (Gal’perin1967). Gal’perin argues that through conceptualization, materializaton, verbalization andinternalization we can hasten development.

References

Adams, R. (2003). L2 output, reformulation and noticing: Implications for IL development.Language Teaching Research, 7, 347–376.

Barnes, D. (1992). From communication to curriculum (2nd ed.). Portsmouth, NH:Heinemann.

Brown, A. & Palincsar, A. (1981). Inducing strategic learning from texts by menas ofinformed, self-controlled training. Topics in learning and learning disabilities, 2, 1–17.

Carruthers, P. & Boucher, J. (Eds.). (1998a). Language and thought: Interdisciplinary themes.Cambridge: CUP.

Carruthers, P. & Boucher, J. (1998b). Introduction: Opening up options. In P. Carruthers & J.Boucher (Eds.), Language and thought: Interdisciplinary themes (pp. 1–18). Cambridge:CUP.

Chaloub-Deville, M. (2003). Second language interaction: Current perspectives and futuretrends. Language Testing, 2, 369–383.

Clark, A. (1998). Magic words: How language augments human computation. In P.Carruthers & J. Boucher (Eds.), Language and thought: Interdisciplinary themes (pp.162–183). Cambridge: CUP.

Crutcher, R. J. (1990). The role of mediation in knowledge acquisition and retention:Learning foreign vocabulary using the keyword method. Tech Report No. 90–10. Boulder:University of Colorado, Institute of Cognitive Science.

Cumming, A. (1990). Metalinguistic and ideational thinking in second language composing.Written Communication, 7, 482–511.

Donato, R. & Lantolf, J. P. (1990). Dialogic origins of L2 monitoring. Pragmatics andLanguage Learning, 1, 83–97.

Ericsson, K. A. & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (2nd ed.).Cambridge, MA: The MIT Press.

Ericsson, K. A. & Simon, H. A. (1998). How to study thinking in everyday life: Contrastingthink-aloud protocols with descriptions and explanations of thinking. Mind, Culture,and Activity, 5, 178–186.

Gal’perin, P. Ya. (1967). On the notion of internalization. Soviet Psychology, 5, 28–33.Gal’perin, P. Ya. (1969). Stages in the development of mental acts. In M. Cole & I. Maltzman

(Eds.), A handbook of contemporary Soviet psychology. New York, NY: Basic Books.


Merrill Swain

Gass, S. M. & Mackey, A. (2000). Stimulated recall methodology in second language research.Mahwah, NJ: Lawrence Erlbaum.

Holunga, S. (1994). The effect of metacognitive strategy training with verbalization onthe oral accuracy of adult second language learners. PhD Dissertation, University ofToronto (OISE), Toronto.

Huang, L. (2004). A little bit goes a long way: The effects of raising awareness of strategyuse on advanced adult second language learners’ strategy use and oral production. PhDDissertation, University of Toronto (OISE), Toronto.

Lantolf, J. (Ed.). (2000). Sociocultural theory and second language learning. Oxford: OUP.Lantolf, J. (2003). Intrapersonal communication and internalization in the second language

classroom. In A. Kozulin, V. Ageev, S. Miller, & B. Grindis (Eds.), Vygotsky’s educationaltheory in cultural context (pp. 349–347). Cambridge: CUP.

Lantolf, J. & Appel, G. (Eds.). (1994). Vygotskian approaches to second language research.Norwood, NJ: Ablex.

Luria, A. R. (1959). The directive function of speech in development and dissolution. Word,15, 341–352.

Luria, A. R. (1973). The working brain. New York, NY: Basic Books.Mackey, A. (2002). Beyond production: Learners’ perceptions about interactional processes.

International Journal of Educational Research, 37, 379–394.Nabei, T. (2002). Recasts in classroom interaction: A teacher’s intention, learners’ awareness,

and second language learning. PhD Dissertation, University of Toronto (OISE),Toronto.

Negueruela, E. (2003). A sociocultural approach to teaching and researching secondlanguages: Systemic-theoretical instruction and second language development. PhDDissertation, Pennsylvania State University, State College, PA.

Newman, D., Griffin, P., & Cole, M. (1989). The construction zone: Working for cognitivechange in school. Cambridge: CUP.

Salomon, G. (Ed.). (1993). Distributed cognitions: Psychological and educational considera-tions. Cambridge: CUP.

Searle, J. R. (2002). Consciousness and language. Cambridge: CUP.Smagorinsky, P. (1998). Thinking and speech and protocol analysis. Mind, Culture, and

Activity, 5, 157–177.Smagorinsky, P. (2001). Rethinking protocol analysis from a cultural perspective. Annual

Review of Applied Linguistics, 21, 233–245.Sokolov, A. (1972). Inner speech and thought. New York, NY: Plenum Press.Swain, M. (2000). The output hypothesis and beyond: Mediating acquisition through

collaborative dialogue. In J. P. Lantolf (Ed.), Sociocultural theory and second languagelearning (pp. 97–114). Oxford: OUP.

Swain, M. (2001). Examining dialogue: Another approach to content specification and tovalidating inferences drawn from test scores. Language Testing, 18, 319–346.

Swain, M. (2005). The output hypothesis: Theory and research. In E. Hinkel (Ed.),Handbook of research in second language teaching and learning (pp. 471–484). Mahwah,NJ: Lawrence Erlbaum.

Swain, M. & Lapkin, S. (1998). Interaction and second language learning: Two adolescentFrench immersion students working together. Modern Language Journal, 82, 320–337.


Verbal protocols

Swain, M. & Lapkin, S. (2002). Talking it through: Two French immersion learners’ responseto reformulation. International Journal of Educational Research, 37, 285–304.

Swain, M. & Lapkin, S. (in press). “Oh, I get it now!” From production to comprehensionin second language learning. In D. M. Brinton & O. Kagan (Eds.), Heritage languageacquisition: A new field emerging. Mahwah, NJ: Lawrence Erlbaum.

Talyzina, N. (1981). The psychology of learning. Moscow: Progress Press.Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes (M.

Cole, V. John-Steiner, S. Scribner, & E. Souberman (Eds.)). Cambridge, MA: HarvardUniversity Press.

Vygotsky, L. S. (1986). Thought and language. Cambridge, MA: The MIT Press.Vygotsky, L. S. (1987). The collected works of L. S. Vygotsky. Volume 1. Thinking and speaking.

New York, NY: Plenum Press.Vygotsky, L. S. (1997). The collected works of L. S. Vygotsky, Volume 4. The history of the

development of higher mental functions. New York, NY: Plenum Press.Wells, G. (1999). Dialogic inquiry: Toward a sociocultural practice and theory of education.

Cambridge: CUP.Wertsch, J. V. (1980). The significance of dialogue in Vygotsky’s account of social, egocentric,

and inner speech. Contemporary Educational Psychology, 5, 150–162.Wertsch, J. V. (1985). Vygotsky and the social formation of mind. Cambridge, MA: Harvard

University Press.Wertsch, J. (1991). Voices of the mind: A sociocultural approach to mediated action.

Cambridge, MA: Harvard University Press.Wertsch, J. & Tulviste, P. (1996). L. S. Vygotsky and contemporary developmental

psychology. In H. Daniels (Ed.), An introduction to Vygotsky (pp. 53–74). London:Routledge.


Functional grammar

On the value and limitations of dependability,inference, and generalizibility

Diane Larsen-FreemanUniversity of Michigan

In this paper, I address the themes of this volume from a functional grammarperspective. I explain the importance of being able to make inferences aboutthe form, meaning, and use of grammar structures that rest on dependabledata. Multiple data sources increase the chances that the data are dependable,i.e., consistent from one context to the next.

However, even by using complementary data sources, it is still the casethat inferences can only be partial and provisional. In addition, it isimpossible to define an optimal level of generalizability for our inferencesapart from the purpose to which the generalization is to be put and for theparticular audience that it is intended. Then, too, for certain appliedlinguistic purposes, optimal levels of generalizability cannot be foreordained,but, instead, will have to be negotiated.

Introduction

I bring a functional grammar perspective to bear on the way people mobilizetheir grammatical resources in order to make meaning, to maintain the flow ofinformation, to manage interpersonal relationships, and to position themselvessocio-politically, among other things. While the findings of any linguistic inves-tigation must always be provisional and partial, for reasons I discuss below, itis important nevertheless to attempt to make claims generalizable, within lim-its. Certainly I would want any grammatical explanation to apply beyond thespecific data from which it was inferred. If claims were limited to a single dataset, they would not be very useful in accomplishing the purposes to which I, asan applied linguist, want to put them, which are:



– to inform the identification of the language acquisition/learning challengeof language students (what is the nature and challenge of that which isbeing learned/acquired?)

– to better understand the various processes contributing to, or interferingwith, meeting the learning/acquisition challenge

– to adopt pedagogical strategies, to design materials, and to educate teach-ers, all informed by an understanding of the learning challenge and learn-ing processes.

I wrote “inform the identification of the learning/acquisition challenge,” notsimply “identify,” because the nature of the challenge, the processes, and thestrategies depend as much on who the learners are as on what it is that they arelearning. In other words, the learning challenge is not purely a linguistic one.For this reason, I will illustrate how the two foci – the “what” of the object oflearning and the “who” of the learner – come together later on in this chapter.

In order to study the “what” or functional grammar, my students and Ihave used contextual analysis (Celce-Murcia 1980), a research methodologythat employs complementary procedures to investigate the patterned use ofgrammatical structures. In this chapter, I first discuss the object of study – whatI seek to construct knowledge about – then I turn to the methodology thatis used to create the knowledge. Following this, I expand upon my assertionabove – that any claims the methodology yields must be provisional and partial.Finally, I consider the difficult issue of what the optimal level of generalizabilityfor applied linguistics research ought to be.

The object and purpose of the research

The research I describe in this chapter seeks to address three questions:

– How is a particular linguistic structure formed, i.e., its morphosyntax andits phonology, when the latter is grammatically relevant? The answer tothis question not only includes paradigmatic grammatical relations, butalso the syntagmatic relationship to items that precede or follow it in thediscourse.

– What does it mean, i.e., its semantics? Another way to pose this questionis to ask what the “essential” or referential meaning of the structure is,beyond its use in a particular context.

– When/why is it used, i.e., what are the salient pragmatic factors governingits use in particular contexts? Why is this form used and not another form


Functional grammar

that has a similar meaning? To answer this question, one must examine therole that the structure under investigation plays in the overall discourse.For example:

Does it initiate, terminate, or continue episodes?Does it contribute to information flow, thematic structure, discourse co-herence or cohesion?Does it signal affective or represent socio-interactional factors that arepresent in the context?Etc.

In applied linguistics, often a binary distinction is made between form andfunction or between form and meaning, rather than the ternary one I have justproposed. I think that binary distinctions overlook an important dimensionto learning and using grammar, which I try to account for with a tripartiteframework. For example, students of English can learn that -ed is used as a pasttense form to mark a completed action, but knowing both form and meaningwill be insufficient in situations of use where speakers are forced minimally intochoosing between using the present perfect or the past for this same meaning,

I have recently returned from Cyprus.I recently returned from Cyprus.

which, in American English, are both permissible sentences. The choice herepresumably depends on one’s orientation to the event (see, for example,Larsen-Freeman, Kuehn, & Haccius 2002). My approach thus rests on a fun-damental linguistic principle: the linguistic system is holistic, a choice existsamong linguistic forms, and when one form is chosen over another, there arealways semantic or pragmatic consequences.

To illustrate the framework in which this research is conducted, considerthe case of the existential there in English.

How is it formed?Existential there is a single, invariant, free morpheme, which occupies the sub-ject position in an English sentence. The logical subject, the one that governsthe verb, follows the verb, which is frequently a form of be. Following the log-ical subject is often, though not always, a locative adverbial. Existential there isunstressed, and therefore phonologically reduced, which is one of the way todistinguish it from the pro-adverbial there.



What does it mean?It is called the existential there because it establishes a mental space inwhich some entity exists or is to be located (Celce-Murcia & Larsen-Freeman1999). It, therefore, has a different (although, of course, related – see Lakoff1987) meaning from the pro-adverbial there, which is used deictically withphysical space.

When or why is it used?It has a presentational function; it brings an element into awareness (Langacker1991). By filling the subject slot with there, a speaker can put an entire propo-sition in the end-focus position in a sentence, a spot reserved for new informa-tion. In this way, its function is to establish or to maintain the thematic focus.

This three-dimensional analysis is the product of an inferential process lin-guists and applied linguists undertake based on the language performance ofspeakers of English. For applied linguistics purposes, as I wrote earlier, notonly the “what,” but also the “who” must be considered in order to informthe identification of the learning challenge. One implication of this assertionis that a parallel inferential process should be conducted with language learnerdata. Inferences based on the language performance of both native speakersand learners are always subject to review, of course, either on their own termsor as additional data are gathered.

A brief, selective treatment of the literature on English language learnerperformance, taking into account only one learner factor – the native languageof learners – suggests that speakers of topic-comment languages, Japanese, forexample, transfer the topic-comment structure of their native language to theirEnglish interlanguage, thus avoiding the use of there where speakers of Englishwould use it. In other words, instead of saying

There are 27 students in Taro’s school.

They say

*Taro’s school is 27 students.*Taro’s school students are 27.*In Taro’s school students are 27.

(Examples from Sasaki 1990, cited inCelce-Murcia & Larsen-Freeman 1999)

Japanese speakers of English also make use of the English verb have, whichallows them to preserve the topic-comment word order of Japanese. While the


Functional grammar

form that results from the application of this strategy is not inaccurate, it couldbe pragmatically inappropriate, depending on the context.

?Taro’s school has 27 students.

Another problem for speakers of topic-comment languages is the formation of“pseudo-relatives” (Yip 1995). For example, Chinese students use the existen-tial there in a way that makes it appear that they have difficulty with Englishrelative clauses (Schachter & Rutherford 1979:3). They say

*There were a lot of events happen in my country.

However, Rutherford (1983) infers that such ungrammatical learner utter-ances do not stem from failure to use a relative pronoun or a participle form,but rather can follow learners’ incorrectly perceiving there to be a functionalequivalence between there and a topic introducer in Mandarin Chinese.

Finally, from the questions they ask, we know that most ESL/EFL studentsare perplexed by the fact that two different sentence constructions are possiblein English, which appear to have the same meaning (although, of course, theyhave entirely different uses).

There are 27 students in Taro’s school.27 students are in Taro’s school.

Inferring from the students’ linguistic behavior and their questions, and in-voking the challenge principle (Larsen-Freeman 1991, 2003), we are left toconclude that for these students, it is the use of existential there that affordsthe greatest long-term learning challenge. Of course, students need to learn itsform and its meaning as well, but since it is impossible to teach everythingabout a given structure, let alone to teach it at one point in time, our conclu-sion tells us where to direct students’ attention in an attempt to be maximallyeffective with the limited instructional time we have available. It will also in-form the choice of pedagogical strategies. For example, our analysis suggeststhat showing students a picture and asking them to make sentences about whatthey see using there is not an effective strategy for addressing the learning chal-lenge most students will confront, which is to learn to use there to introducenew information. This is because if the students and teacher are looking at thesame picture, students would be using there to make sentences about knowninformation, and such a teaching strategy would therefore mislead studentsabout the use of there.

In terms of teacher education, teacher learners need to be able to answerthe three questions about the form, the meaning, and the use of there. Further-



more, in order to take advantage of learners’ cognitive abilities, teachers mightthink in terms of reasons, not rules (Larsen-Freeman 2000). For example, thewell-known rule that states that the logical subject, which follows the verb insentences with existential there, is indefinite can easily be explained not as anarbitrary rule of form, but as a reason: the logical subject is indefinite becauseit conveys new information.

I have gone to some length to illustrate what it is that I do and the purposesto which the results are put. It is time now to be explicit about the generalthemes of this volume. For the research program that I am describing in thischapter, generalizability is desirable. Researchers using this approach seek to beable to make inferences, such as the function of X is Y, or X collocates with Y,or X means Y – based on performance data that are dependable, i.e., that areconsistent from one instance to the next.

A research methodology for functional grammar

Contextual analysis is a research methodology that is used to examine writtenand spoken data in an attempt to account for the form, the meaning, and theuse of linguistic forms in texts. To my knowledge, the research has only ex-amined native English speaker texts, but there is nothing in the methodologyitself that would prohibit its being used to analyze other varieties of English,English as a lingua franca, or for other languages. Indeed, given that manylearners do not aspire to native English speaker performance, it would no doubtbe desirable to have it apply to whichever language or language variety is tobe learned.

There are five steps to the methodology (based on Celce-Murcia 1980,1990):

1. Choose a linguistic structure to examine. One’s choice may be informed byobservation of usage data, or by being unhappy with simplistic pedagogi-cal “rules of thumb,” or by seeing one’s language students struggle with aparticular structure, etc.

2. Review the literature. Read published linguistic and pedagogical accountsthat others have written about the structure chosen.

3. Make use of intuitions. Intuitional data might consist of grammaticalityjudgments (if form is the issue) or of contrasting two made-up sentencesor texts if meaning or use is the issue (cf. She isn’t much fun. She isn’t alot of fun).


Functional grammar

4. Observe the structure as it is used in natural written and spoken discourseof native speakers. Try to determine why a particular structure is used in aparticular context and why it is used rather than some other structure witha similar meaning.

5. Test hypotheses generated from steps 2–4 by eliciting judgments or datasamples from other speakers.

It can be seen, then, that because there are three primary, complementary datasources, intuitional, observational, and elicited, contextual analysis seeks tomaximize the dependability of linguistic data from which patterns can be ob-served and for which generalizable explanations can then be inferred. I wouldlike to turn next to examining the three data sources and what each contributesto this goal.

Intuitional data

Using speaker intuitions about constructed examples is a time-honored prac-tice in linguistics research. In fact, since the rejection of behaviorism, in-trospecting about linguistic structures has been a most favored heuristic oflinguists (Gass & Mackey 2000:10). Indeed Gass and Mackey quote Bard,Robertson, and Sorace (1996:32) as stating that “For many linguists, intuitionsabout the grammaticality of sentences comprise the primary source of evidencefor or against their hypotheses.”

However, intuitions can be unreliable or undependable when looking fortypical patterns because humans tend to notice the unusual more than thetypical (Biber, Conrad, & Reppen 1998), and because inferences based on onelinguist’s intuitions may not be very generalizable for the purposes for whichwe would like to use our findings.

Further, although

. . .speaker intuitions about constructed examples are an invaluable tool, theiruse requires at least the following: an acceptance and appreciation of the clineof acceptability and interspeaker variability that is typically associated withsuch examples; an understanding of the nature of “deviance” from linguisticnorms; and, most generally, some serious reflection on what such judgmentsactually tell us. But even with such judicious use, intuitions about constructeddata cannot be treated as the sole, or even primary, source of evidence as tothe nature and properties of the linguistic system.

(Barlow & Kemmer 2000:XV)



So although contextual analysis makes use of intuitional data, for the sake ofdependability and generalizability, it complements them with data from twoother sources.

Observational data

The source of the observational usage data varies depending on the structurebeing investigated. It would be appropriate, for instance, to examine conversa-tional transcripts if one were interested in investigating English tag questions.On the other hand, if one wanted to look at the use of the passive voice inscientific writing, then a different source of data would clearly be warranted(Celce-Murcia 1980).

However, modality and genre of usage data are only two factors that areknown to affect linguistic choice. Many other factors have been found to influ-ence the meaning or use of linguistic forms, such as the setting, the relationshipof speaker and listener or reader and writer, their characteristics (gender, age,educational level), the register of the discourse, whether speech is planned orunplanned, monologic or dialogic, etc. Ideally, for maximum dependabilitythese factors should be systematically addressed as well.

This leads us to a critical problem when relying on observational data.Until recently, it was very difficult to collect sufficient examples to permit sys-tematic sampling of factors that might affect the form, the meaning or theuse of particular structures. As it was, researchers would pore over publishedtranscripts of oral data or various types of written data searching for, andhand-counting, instances of the target structure. Given the tedious and time-consuming nature of the observational data collection process, it was clearlynot realistic to expect that many of the factors known to affect linguistic choicecould be systematically addressed. This limitation compromised the depend-ability of the data and limited the generalizability of any claims based on them.

More recently, the availability of large computer databases and tools hasprovided an alternative to such tedious procedures and non-representativedata. These resources significantly facilitate the study of the patterned ways inwhich speakers use the grammatical resources of a language.

To briefly illustrate this point requires returning to the discussion concern-ing the existential there. In his search of a corpus of 450,000 words of spokentext and 300,000 words of written texts of modern, educated, British Englishspeaker usage, Breivik (1981) was able to find very few instances of the type ofsentence I gave earlier, where the logical subject with an indefinite determineroccupies subject position (27 students are in Taro’s school.). One example he did


Functional grammar

find, A headphone bar is also on the first floor, was in a text describing what onewould find on various floors of an electronic equipment store.

The first floor houses the real heart of the store – the hi-fi depart-ments – but there is much else here also. . .A headphone bar is also onthe first floor.

This example is helpful for it contains both types of sentence we are interestedin – sentences with and without the existential there used for a presentationalfunction. From Brevik’s findings, exemplified in this text, we might infer thatthe paucity of sentences such as the final one in this excerpt can be attributedto the fact that they require that a scene already be established or presupposedto which they then contribute details. This requirement contrasts with the sen-tence that precedes it, which contains the existential there, which has no suchrequirement, and which here presumably asserts the existence of “much else,”thus setting the scene for the sentence without there. The frequency distribu-tion of these two sentences in the corpus would be helpful information for ourapplied linguistic purposes as they would in providing support for inferring apresentational function for existential there.

Such inferences must always be provisional, however, subject to refutationas other data are considered, especially when the data include texts from morediverse speakers. While representativeness of a corpus is critical if we are seek-ing a comprehensive description of a structure, we also have to accept that anyabsolute comprehensiveness is impossible to achieve. Thus, inductive research,of the sort I have been describing, always seeks to corroborate, expand upon,or to refute that which has preceded it.

To illustrate the tenuous nature of linguistic explanations, consider thework of Kim (1995). Kim was interested in speakers’ motivation for using WH-clefts in English. Traditional accounts held that WH-clefts exist to put specialfocus on new information in the predicate. For example,

What that amounts to is that they don’t keep comparable books.

However, from Kim’s examination of spoken data, he was led to concludethat the traditional explanation was incomplete. Only three examples in hisdata supported this function uniquely. Many of the remaining 73 exampleshighlighted a speaker-oriented interactional function of WH-clefts, such as aspeaker’s marking a topic shift for a listener

What you are saying reminds me of. . .



While in this example the WH-cleft still contrasts old and new information,the following example, drawn from Kim’s data, does not highlight the contrastbetween the types of information so much as it displays speaker affect

What I love is when they’re talking about something. . .

Kim’s investigation of data from face-to-face and telephone conversations, aug-mented by talk radio broadcasts and group therapy sessions, was importantfor expanding the inventory of uses for WH-clefts. However, Kim’s data werelimited to 73 tokens of such clefts. By contrast, computerized corpora provideaccess to many more instances of a target structure. For example, a prelimi-nary search of MICASE (Michigan Corpus of Spoken Academic Discourse), a1.8 million-word database, developed, maintained, and made freely availableto the research community by the English Language Institute of the Universityof Michigan, produced a list of 1052 potential WH-clefts. Admittedly this listcontains some interlopers masquerading as clefts, which would have to be elim-inated through a refined search (Larsen-Freeman 2004). Nevertheless, even acursory analysis points to an additional function of WH-clefts in spoken aca-demic discourse, that of prospectively or retrospectively framing instruction(John Swales, personal communication)

What we want to do today is. . .

a function not identified in previous research.This observation itself points to a potential complication in the use of cor-

pora to advance our research agenda. The larger number of instances of a targetstructure, which corpora give access to, increases the dependability or stabilityof observed patterns, but may also reveal a multiplicity of “lower level” find-ings. Each new corpus, having its own special features, may illustrate morespecialized functions for a given structure or more variation in the structureitself. Of course, this is no reason not to continue to look for new functionsor structural variation – just a caution that even if were possible to accountcomprehensively for all the forms, meanings, and uses of a given structure,generating such an exhaustive list would not necessarily meet the needs of ap-plied linguists. If, as applied linguists, we want findings that will usefully informgeneral, rather than narrowly focused specific purpose, SLA, pedagogy, andteacher education, a long list of forms, meanings, and functions for any onetarget structure would only be a starting point. The list would still have to bewinnowed and shaped in order to manage its contribution to language learning(Widdowson 2000).


Functional grammar

Of course, an advantage to starting with a large number of tokens of the tar-get structure, which computer corpora of a certain size make available, is thatstatistical tests can be used. Such tests can help us to assess the likelihood ofwhether or not the observed difference or relationship could be due to chance(e.g., using the chi-square test) and then secondly to measure the strength ofthe relationship between the two variables (e.g, using an association coefficientto say whether or not there is a strong association, such as a collocation, be-tween the two forms). To cite an example (I thank N. Ellis for bringing it tomy attention), a chi-square test can provide evidence that charges occurs afterbring more often than randomly, but it cannot tell us how strongly related thetwo words are even though the result may be statistically significant. By thentesting the strength of the association using the uncertainty coefficient, onecould find an estimate of the relative reduction of uncertainty for predictingthat given the word bring, the word charges will follow. With an uncertaintycoefficient close to 1, there is an indication that a strong association (colloca-tion) has been found and generalizability claims are possible (G. Demetriou [email protected]). Alternatively, one can calculate an MI (mutualinformation) score, which, if greater than 2, shows a substantial association be-tween the 2 words (Kennedy 2003). Given a large enough data set, we could alsoconsider the influence of multiple factors at the same time. For example, wecould contrast the influence of the matrix verb (which Kim found noteworthy)with the type of speech event.

Thus, linguistic corpora can be enormously helpful in conducting contex-tual analyses by providing abundant examples of the target structure, therebyincreasing the dependability or stability of inferred patterns across languagesamples, and, in tandem with statistics, may permit greater generalizability.However, at this time, concordancing programs are limited to “low-level”searches of the sort that I have just illustrated with bring and charges. It isnot possible to investigate complex grammatical constructions unless you haveprogramming skills (Biber, Conrad, & Redden 1998:15) or access to an appro-priately tagged corpus. Furthermore, corpus searches simply yield numbers ofinstances. Interpretation of the instances in order to identify patterns in thedata is still necessary. Numbers do not guarantee insight. After all, sometimesa skillful microanalysis of a single text may yield significant insights into theform, the meaning, and the use of the target structure (see Markee chapter inthis volume).

To underscore this point, let me discuss some work by Biber, Conradand Reppen (1998), who investigated “nearly equivalent” that-clauses and to-clauses from a lexico-grammatical perspective using approximately 4 million



words of academic prose from the Longman-Lancaster Corpus and approxi-mately 5 million words of conversation from the British National Corpus. Us-ing computer programs to count which verbs occurred with each complementtype, the researchers show that completely different sets of verbs most com-monly control that-clauses versus to-clauses. That-clauses are linked with threematrix verbs – think, say and know, although believe, mean, and tell also com-monly take that-clause complements. According to the researchers, their typi-cal use is in reporting what was thought felt, or said. In contrast, the researchersfound that the most common verbs occurring with to-clauses come from awider semantic domain. Two verbs are particularly common with to-clausesexpressing desire (want and like), while a third is used to express effort (try).

While these observations (and others that they go on to make) are help-ful, they were not unknown before the use of computerized corpora. Yearsago, Givón (1980) discussed the correlation between the matrix verb and thecomplement it takes. He observed that the main verbs (such as assume, imag-ine, know, understand, say) that take ordinary tensed that-clause complementsmostly denote mental states or attitudes regarding the truth of the propositionin the complement clause. Thus, if someone were to say

I think that it will rain today.

the speaker is saying something about the nature of his/her belief, and its rel-ative strength. Givón puts such verbs into the cognition-utterance category(Williams in Celce-Murcia & Larsen-Freeman 1999). Furthermore, Bolinger(1968) pointed out still earlier that verbs taking infinitive complements en-code future unfulfilled projections, which it would seem the matrix verbs want,like, and try that Biber et al. point to do. Thus, while corpus linguistics of-fers a means for testing and well as for discovering linguistic principles, someinsightful inferences have long endured and will likely continue to do so.

Another limitation of observational data, whether part of a computer cor-pus or not, is that they are attested data. Such data show how the linguisticcode has been performed at some point in time; they do not show the poten-tial of the code (Widdowson 1990). They also do not show what cannot occur,or what is aberrant if it does occur (Howard Williams, personal communica-tion). For after all, spontaneously produced utterances provide only part of thepicture (Gass & Mackey 2000). If one wants to obtain information about whyspeakers use grammar in the way that they do, it is essential to determine whatlearners think is possible in the language and what is not – to understand thelimits of the system. All that is directly observable is what a learner produces


Functional grammar

in writing or speech. Such data do not show what is underlying the observablelinguistic behavior. To get at this, elicited data are used.

Elicited data

It could be said that the procedures that I have described in this chapter so farfall into the hypothesis-formation category. Typically, by surveying the litera-ture, consulting one’s intuitions, and examining observational data consistingof oral and/or written discourse, researchers are able to come to some tentativeanswers to the questions posed. However, these still must be subject to testsdesigned to tap speaker intuitions.

Eliciting data permits one to narrow one’s hypothesis space by carefullymanipulating the factors that one wants to test. For example, Williams (1996)tested the hypothesis that nevertheless is more restricted in its use than however.While however can almost be used generically wherever attention is drawn todifference, nevertheless requires a situation where one is led to expect one thingbut finds something different to be true (Celce-Murcia & Larsen-Freeman1999). In order to test this inference, Williams designed a questionnaire thatasked 30 native speakers of English to indicate their acceptance of neverthelessor however in a set of sentences he constructed. From their responses, Williamsfound strong acceptance for nevertheless only for certain sentences, such as thefirst one in the following pair:

John has always been a top math student.______________ he failed calculus this quarter. (83 percent acceptance)John has always been a top math student.______________ he failed history this quarter. (20 percent acceptance)

What this suggests is that nevertheless is more restricted in its use than however,an observation supported by the fact that within the MICASE corpus, therewere 214 instances of however in spoken academic discourse, but only 13 in-stances of nevertheless. More in-depth analysis of the actual instances wouldhave to be undertaken, of course, to see if the distinction that is reported aboveholds. Then, too, the elicitation instrument itself should be validated and itsresults subject to statistical tests to ensure that any inferences that would bedrawn are supported.



On the provisionality and partiality of linguistic inferences

Even by using three complementary data sources, by assuring the validity ofthe instruments, and by subjecting the results that they yield to the proper sta-tistical tests, it is still the case that any linguistic inference must be provisionalfor the reason stated earlier: Any inference achieved through an inductive pro-cess is always subject to refinement by a counterexample. In addition to beingprovisional, any linguistic inference is also partial. There are several reasons forthis. First of all, language usage is not homogeneous. What will be elicited ona given occasion with a given instrument varies with co-text and context, andit is impossible to be comprehensive in accounting for all the possible permu-tations of these. Moreover, even the largest possible database is selective. Notevery instance of language use has been recorded and is computer-searchable.Then, too, we cannot with any assurance draw conclusions about the grammarof an individual from usage facts about communities (Newmeyer 2003). Modaltendencies can be revealed, but the decision to conform or to innovate is alwaysleft up to the individual. Finally, language is constantly on the move, constantlychanging, and that part which is pre-systematic is likely to be overlooked.

To expand upon this last point, it could be said that all extant methods,contextual analysis included, give us only a record of attested usage at onepoint in time. They cannot tell us what transpired in the language up untilthis point, nor where it is destined. While this may seem obvious and forgiv-able as the system exhibits stability as well as mutability (Givón 1999), we oftenunderestimate the latter. We therefore miss the perpetually changing, perpet-ually dynamic nature of language (Larsen-Freeman 1997). As I have arguedelsewhere, applied linguistics would no doubt be well-served by thinking oflanguage and its learning in terms of dynamic systems, abundant examples ofwhich occur in the natural world (Larsen-Freeman 2003). When we conceiveof language as a static system, we contribute to “the inert knowledge problem”(Whitehead 1929 in Larsen-Freeman 2003). We teach language (grammar) asif it were a static system, and so it becomes one for language learners, whofind that they cannot apply what they have learned to novel situations, whichis surely the ultimate test of whether something has been usefully acquired.

Optimal level of generalizability

In this chapter, I have staked out the position that generalizability is desirable –that what we would like is to have claims that apply beyond particular in-


Functional grammar

stances. While this is true enough, it is important to recognize that there arelimits to generalizability as well, at least for applied purposes. What we shouldbe seeking is an optimal level of generalizability, a level admittedly difficult todefine apart from the purpose for which it is to be put. For example, linguistshave pointed out that the -ed morpheme occurs in a number of environments;it is not simply a past tense marker. It occurs, for instance, as a past participle,thus in perfective aspect and passive voice as well, in backshifted verbs in thecomplement clauses of reported speech, as participial adjectives, in imaginativeconditionals, etc. Inferring a meaning underlying all these instances, Knowles(1979) suggests that the -ed signals remoteness. While this is an enlighteninglinguistic observation, a generalization this broad may not be especially helpfulfor learners of English or only so for learners who have achieved a certain levelof proficiency.

The price for increased generalization is increased abstraction. While somelinguists pursue the broadest possible generalizations, which are therefore nec-essarily abstract (for example, Chomsky’s 1995 efforts to try to account for thestructure of all the world’s language with a minimal number of principles), itis difficult to see how such abstract principles are useful for all the purposes towhich applied linguists wish to put them.

Conversely, of course, linguistic principles inferred from speakers’ per-formance can be undergeneralized. With so many factors affecting linguisticperformance, one may have to resort to statements such as the following:“Structure X is used by female Standard English speakers (ages 45–50) in theirconclusions to commencement addresses delivered at major research universi-ties in North America during the years 2000–2005.” While the need for suchparticularizability may be exaggerated, the point is that just as generalizationscan be too broad for applied linguistic purposes, they can also be too narrow.Where parsimony (few broad generalizations) or alternatively, comprehen-siveness (many narrow generalizations) may be desirable from a theoreticalperspective, from an applied perspective, it is hard to see the value of eitherextreme. A definition of the optimal level of generalizability of linguistic infer-ence is difficult to define with any precision. What we are left with is “not toomuch so as to become so abstract, not too little so as to put an unnecessarylearning burden on the student.”

Perhaps, we need to take a lesson from Marr (1982). Marr has pointed outfor vision that the same object may be represented at various levels of detail. Asa consequence of this, one cannot simply talk about a perceived object possess-ing some property. Instead, one must talk about whether, given a certain levelof detail, it is seen to have this property. By the same token, from an applied



linguistics perspective, one must always inquire as to which purpose an expla-nation is to be put in order to determine at which level of detail an appropriatelevel for generalizability is to be aimed.

Of course, the optimal level can never be entirely foreordained. It is in thenature of teaching and learning that the optimal level potentially arises out ofnegotiations between the teacher and students in a way that past understand-ing is taken into account and built upon. Still, for materials developers andothers who must prepare decontextualized explanations for explicit teachingpurposes, questions about the optimal level of generality persist.

To make the point another way, and to introduce another concern, thisone having to do with the unit of analysis, let me digress for a moment toconsider that fact that an innovation of computer corpora is that they can besearched to reveal the patterns that can be associated with particular lexicalitems. An example from Hunston and Francis (2000) involves the noun matter.It turns out that matter is often preceded by an indefinite article and followedby the preposition of and a gerund beginning with -ing, for example, a matter ofdeveloping skills, a matter of learning a body of information, a matter of becomingable to. . . ”. There is, therefore, little point in treating matter as a single lexicalitem that can be slotted into a general grammar of English. Rather, the wordmatter comes with attendant phraseology.

While this is of interest, particularly if one is pursuing a lexical approach tolanguage acquisition and language pedagogy, it is hard to imagine constructinga coherent research agenda or a comprehensive syllabus based on single lexicalitems and phrases unless restricted somehow to high frequency lexicogram-matical units or units that are broader in scope. Therefore, in this paper, Ihave appealed to traditional linguistic categories, verb tense/aspect, tag ques-tions, passive voice, existential there, and WH-clefts. While these are broaderand commonly recognized, therefore units of convenience, they are not nec-essarily psychologically real for learners. It is not clear that language studentssegment the target language in the same way. If not, then whatever can be in-ferred from native speaker usage data might not be particularly illuminatingwhen analyzing interlanguage data. As has been pointed out before, it may wellbe that learners’ emic perceptions of the target depart radically from any eticlinguistic categories.


Functional grammar

Conclusion

I have argued in this chapter that researchers in functional grammar shouldwork to ensure that their inferences are drawn from dependable data. One wayto do this is to utilize complementary data sources. However, because infer-ences based on the data are arrived at inductively, they are always subject tocounterexamples. Further, it is highly likely that counterexamples will arise inlanguage performance data since the data that are examined are always influ-enced by a number of speaker and contextual factors, are always a subset of thewhole, and are always changing.

I have also made a case for the need for generalizability of linguistic in-ferences for second language acquisition, language pedagogy, and teacher ed-ucation purposes. In order for inferences to be useful for applied linguisticspurposes, they have to address the “who,” not only the “what” and they mustbe generalized at an optimal level depending on the purpose for the explana-tion. In addition, for pedagogical purposes, the level of generalizability shouldbe negotiated between teacher and students so as to be neither too abstract nortoo particularized for a given group of learners. Finally, I have pointed out thattruly useful data would be those that have both psychological and linguisticvalidity for speakers and learners of a language.

References

Bard, E., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic accepta-bility. Language, 72(1), 32–68.

Barlow, M. & Kemmer, S. (Eds.). (2000). Usage based models of language. Stanford, CA: CSLIPublications.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structureand use. Cambridge: CUP.

Bolinger, D. (1968). Entailment and the meaning of structures. Glossa, 2(2), 119–127.Breivik, L. (1981). On the interpretation of existential there. Language, 57(1), 1–25.Celce-Murcia, M. (1980). Contextual analysis of English: Application to TESL. In D. Larsen-

Freeman (Ed.), Discourse analysis in second language research (pp. 41–55). Rowley, MA:Newbury House.

Celce-Murcia, M. (1990). Data-based language analysis and TESL. In J. E. Alatis (Ed.),Proceedings of Georgetown University roundtable on languages and linguistics 1990 (pp.245–259). Washington, DC: Georgetown University Press.

Celce-Murcia, M. & Larsen-Freeman, D. (1999). The grammar book: An ESL/EFL teacher’scourse (2nd ed.). Boston: Heinle & Heinle.

Chomsky, N. (1995). The minimalist program. Cambridge, MA: The MIT Press.



Gass, S. & Mackey, A. (2000). Stimulated recall methodology in second language research.Mahwah, NJ: Lawrence Erlbaum.

Givón, T. (1980). The binding hierarchy and the typology of complements. Studies inLanguage, 4(3), 333–377.

Givón, T. (1999). Generativity and variation: The notion of “rule of grammar” revisited. InB. MacWhinney (Ed.), The emergence of language (pp. 81–114). Mahwah, NJ: LawrenceErlbaum.

Hunston, S. & Francis, G. (2000). Pattern grammar. Amsterdam: John Benjamins.Kennedy, G. (2003). Amplifier collocations in the British National Corpus: Implications for

English language teaching. TESOL Quarterly, 37(3), 467–487.Kim, K.-H. (1995). WH-clefts and left-dislocation in English conversation: Cases of

topicalization. In P. Downing & M. Noonan (Eds.), Word order in discourse (pp. 245–296). Amsterdam: John Benjamins.

Knowles, P. (1979). Predicate markers: A new look at the English predicate system. CrossCurrents, VI(2), 21–36.

Lakoff, G. (1987). Women, fire, and dangerous things. Chicago, IL: Chicago University Press.Langacker, R. (1991). Foundations of cognitive grammar. Volume 2. Stanford, CA: Stanford

University Press.Larsen-Freeman, D. (1991). Teaching grammar. In M. Celce-Murcia (Ed.), Teaching English

as a second or foreign language (2nd ed., pp. 279–296). New York, NY: NewburyHouse/HarperCollins.

Larsen-Freeman, D. (1997). Chaos/complexity science and second language acquisition.Applied Linguistics, 18(2), 141–165.

Larsen-Freeman, D. (2000). Grammar: Rules and reasons working together. ESL/EFLMagazine, January/February, 10–12.

Larsen-Freeman, D. (2003). Teaching language: From grammar to grammaring. Boston, MA:Heinle & Heinle.

Larsen-Freeman, D. (2004). Enhancing contextual analysis through the use of linguisticcorpora. In J. Frodesen & C. Holten (Eds.), The power of context in language teachingand learning. Boston, MA: Heinle & Heinle.

Larsen-Freeman, D., Kuehn, T., & Haccius, M. (2002). Helping students make appropriateEnglish verb tense-aspect choices. TESOL Journal, 11(4), 3–9.

Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information. New York, NY: W. H. Freeman and Company.

Michigan Corpus of Spoken Academic English. English Language Institute. University ofMichigan, Ann Arbor. Website: www.hti.umich.edu/m/micase

Newmeyer, F. (2003). Grammar is grammar and usage is usage. Presidential addressdelivered at the Annual Meeting of the Linguistic Society of America, Atlanta, January.

Rutherford, W. (1983). Language typology and language transfer. In S. Gass & L. Selinker(Eds.), Language transfer in language learning (pp. 358–370). Rowley, MA: NewburyHouse Publishers.

Sasaki, M. (1990). Topic prominence in Japanese EFL students’ existential constructions.Language Learning, 40(3), 333–368.

Schachter, J. & Rutherford, W. (1979). Discourse function and language transfer. WorkingPapers in Bilingualism, 19, 1–12.


Functional grammar

Whitehead, A. N. (1929). The aims of education. New York, NY: MacMillan.Widdowson, H. (1990). Aspects of language teaching. Oxford: OUP.Williams, H. (1996). An analysis of conjunctive adverbial expressions in English. PhD

Dissertation. University of California, Los Angeles.Yip, V. (1995). Interlanguage and learnability: From Chinese to English. Amsterdam: John

Benjamins.


A conversation analytic perspective on therole of quantification and generalizabilityin second language acquisition

Numa MarkeeUniversity of Illinois at Urbana-Champaign

This chapter develops a methodological critique of quantitative,experimental approaches to input and interaction in mainstream,cognitive SLA from the qualitative perspective of ethnomethodologicalconversation-analysis-for-second-language-acquisition (CA-for-SLA). Thechapter illustrates the substantive and methodological insights that may begained from using a single, deviant case analysis approach to understand howlanguage learning behavior is organized. This analysis also highlights issuessuch as the role of inferencing in the interpretation of data and problematizesthe extent to which mainstream SLA studies are in a position to make validgeneralizations about the function and organization of repair in languagelearning.

Introduction

I begin by reviewing the literature on the Interaction Hypothesis in secondlanguage acquisition (SLA) studies and then develop a methodological cri-tique of experimental approaches to SLA on the basis of insights drawn fromthe field of conversation analysis (CA). The empirical analysis that followsof a learner’s use of a rare, possibly unique, example of a particular type ofCounter-Question in SL classroom talk highlights some of the substantive andmethodological insights that may be gained from using a single case analy-sis approach to “conversation-analysis-for-second-language-acquisition” (CA-for-SLA) (Markee 2005). This analysis also highlights issues such as the role ofinferencing in the interpretation of data and problematizes the extent to whichSLA studies are in a position to make valid generalizations about the func-tion and organization of repair in language learning. I conclude by outlining


Numa Markee

what the prospects are for a grounded experimental approach to the study ofrepair in SLA.

The interactionist hypothesis: An overview

This chapter aims to offer a social constructivist (and hopefully constructive)critique of how we conceptualize, and do, SLA research on social interaction.Let me first acknowledge the depth, breadth, and dynamism of what – for wantof a better description – I will call “mainstream” SLA studies during the past 25years. For present purposes, the story begins with Hatch (1978), who arguedthat “one learns how to do conversation, one learns how to interact verballyand out of this interaction syntactic structures are developed” (p. 404). Thisremarkable statement of the Discourse Hypothesis (see also the work of Peck1978, 1980; Sato 1986, 1988) continues to influence SLA studies to this day.More specifically, drawing on Hatch’s ideas, Krashen (1980, 1981, 1982, 1985)developed the first theory of SLA (Monitor Theory), whose most importantand enduring tenet was the Input Hypothesis. That is, Krashen suggested thatlearners learn SLs by being exposed to language that is slightly beyond theircurrent level of competence – so called “comprehensible input” or “i+1.”

In Krashen’s work, i+1 is a static concept: it washes over learners, who pickup new language from contextual clues. However, in Long’s work on the In-teraction Hypothesis, comprehensible input is something that learners activelyhave to get for themselves (1980, 1981, 1983a, 1983b, 1985a). They do this byinitiating a variety of conversational repairs with their native speaker (NS) ornon-native speaker (NNS) interlocutors (Long 1983a; Long 1983b; Varonis &Gass 1985a, 1985b). Repair categories include comprehension checks, clari-fication requests, confirmation checks, and a variety of less commonly usedcategories, such as verifications of meaning, definition requests, and expres-sions of lexical uncertainty (Porter 1986). The function of repairs is to makeinitially incomprehensible talk progressively more understandable to learnersas they attempt to negotiate meaning (Pica, Doughty, & Young 1986).

Repairs occur more frequently in NS-NNS talk than in native speaker-native speaker (NS-NS) interaction (Ellis 1985; Long 1980, 1981b, 1983b; Pica& Doughty 1985) and more frequently still in non-native speaker- non-nativespeaker (NNS-NNS) talk (Long & Porter 1985). Furthermore, two-way taskspromote more repairs than one-way tasks (Doughty & Pica 1986; Long 1980;Pica 1987). Convergent tasks also trigger more repairs than divergent tasks(Duff 1986). Moreover, unfamiliar tasks promote more repairs than familiar


A conversation analytic perspective

tasks, and unfamiliar interlocutors repair their talk more often than familiarconversational partners do (Gass & Varonis 1984). In addition, more repairsoccur in groups made up of people from different L1 backgrounds (Varonis &Gass 1985b). Mixed-gender groups engage in more repairs than same-gendergroups (Gass & Varonis 1986; Pica, Holliday, Lewis, & Morgenthaler 1989. Seealso Long 1989, 1990; Long & Porter 1985; Pica 1992 for reviews). And more re-pairs occur in groups made up of mixed proficiency levels as opposed to groupsin which learners are of the same proficiency (Yule & McDonald 1990). Notealso that classroom-oriented research by Foster (1998) suggests that the oc-currence of repairs may not be a function of task type so much as whetherlearners are put into pairs rather than small groups. According to this latterscenario, there is a great deal of variation at the individual level on whetherrepairs are initiated at all. When repair initiations do occur, they seem to occurmore frequently in dyads than in small groups.

The opportunity to plan also seems to lead to greater negotiation (Crookes1989; Foster & Skehan 1996; Mehnert 1998). Furthermore, task complexity,structure and processing load all have an impact on learners’ performance (P.Robinson 1995; Skehan & Foster 1997). This body of research, combined withrelated work on language produced during small group work as opposed tolockstep work (Bygate 1988; Pica & Doughty 1985; Long, Adams, McLean, &Castaños 1977) has culminated in various empirically-based proposals for task-based language teaching (Long 1985b, 1989, 1991; Long & Crookes 1992, 1993;Nunan 1993; Pica, Kanagy, & Falodun 1993).

A key extension of the Input Hypothesis has been proposed by Swain(1985), who argues that learners also need to produce comprehensible output inorder to move on from merely getting the semantic gist of what is being said toproducing new language that is syntactically analyzed. Krashen’s rebuttal of theoutput hypothesis notwithstanding (Krashen 1989), this position has receivedconsiderable empirical support in recent years (see Carroll & Swain 1993; Gass& Varonis 1994; Kowal & Swain 1994; Pica 1987, 1992; Pica, Holliday, Lewis,& Morgenthaler 1989; Shehadeh 1999; Swain & Lapkin 1995). Comprehensi-ble output is currently thought to serve three main functions: (1) it promotesthe “noticing” of new linguistic forms by learners; (2) it enables learners to testhypotheses about how the SL works; and (3) it also serves the metalinguisticfunction of allowing learners to control and internalize linguistic knowledge(Swain 1995).

In its latest version, the Interaction Hypothesis (IH) focuses on therole of attention, awareness, a focus on form, and the function of negativefeedback in SLA:


Numa Markee

It is proposed that environmental contributions to acquisition are mediatedby selective attention and the learner’s developing L2 processing capacity, andthat these resources are brought together most usefully, though not exclusively,during negotiation for meaning. Negative feedback obtained through negotia-tion work or elsewhere may be facilitative of L2 development, at least for vo-cabulary, morphology, and language-specific syntax, and essential for learningcertain specifiable L1-L2 contrasts. (Long 1996:414, emphasis in the original)

More specifically, contrary to the position adopted by Krashen (1985, 1989)and VanPatten (1988) that learning is a sub-conscious process, Kormos (2001),Schmidt (1990, 1993, 1994), and Schmidt and Frota (1986) suggest that adultSL learners must consciously notice or “apperceive” (Gass 1988, 1997) newlanguage forms in the input in order for it to become available for learn-ing. Support for this position is provided by theoretical and empirical studieson enhanced input (Doughty 1991; Sharwood-Smith 1991, 1993). This worksuggests that subjects who focus on both form and meaning do better thanlearners who focus on isolated grammatical forms. Support for this position isalso provided by classroom research (see, for example, Harley 1989; Lightbown& Spada 1990; Tomasello & Herron 1988; White, Spada Lightbown, & Ranta1991) on the relative merits of what Long (1988, 1991) has called a focus onform (= learners paying attention to linguistic form in the process of engagingin meaning-oriented talk) rather than forms (= learners working on language asa decontextualized system; see also Doughty & Williams 1998; Muranoi 2000;Spada 1997; Williams 1999).

The role of negative feedback as a possible factor in second language (SL)learning has also become an important issue in SLA studies. Arguing from aUniversal Grammar (UG) perspective, Gregg (1984, 1993, 1996) and White(1987, 1989, 1991) have cast doubt on the theoretical importance of i+1 in SLA.Indeed, White suggests that what learners need is not so much comprehensibleinput as incomprehensible input, and that positive evidence alone cannot serveas the sole means of destabilizing learner’s interlanguage. Citing the exampleof different adverb placement rules in French and English, respectively, White(1991) argues that NSs of French learning English as a SL will not encounterany input in English that specifically prohibits verb-adverb-direct object strings(for example, “Je bois toujours du café” = literal translation: “I drink alwayscoffee” = translation into standard English: “I always drink coffee”). SL learnersthus need negative evidence to tell them that such a construction does not workin English (see also Birdsong 1989).

Research on the function and efficacy of negative feedback began withwork on the effects of error correction on learner output (Brock, Crookes, Day,



& Long 1986; Chaudron 1977, 1987, 1988; Crookes & Rulon 1988; Lightbown& Spada 1990; Salica 1981; Spada & Lightbown 1993; White 1991; White,Spada, Lightbown, & Ranta 1991). Later work has tended to focus on the rel-ative effectiveness of different types of implicit and explicit negative feedback,particularly recasts and reformulations of the input (Doughty & Varela 1998;Gass 1997; Long, Inagaki, & Ortega 1998; Lyster 1998; Lyster & Ranta 1997;Mackey & Philp 1998; Nicholas, Lightbown, & Spada 2001; Oliver 1995, 1998,2000). While the jury is still out on the precise role played by negative feedbackin SLA, it is likely that such feedback facilitates SL leaning, and may also benecessary for learning some L2 structures (Long 1996).

A critique of mainstream work on SLA

As I have already noted, the body of research spawned by the IH is impressive.And yet . . . there are various epistemological, methodological and substantiveissues that remain unanswered. For present purposes, I couch my discussion ofthe issues in terms of Schegloff ’s (1993) critique of attempts to quantify socialinteraction.

Despite the noteworthy contributions of Hawkins (1985) and Swain andher associates to the formulation of the IH, qualitative research is dramaticallyunder-represented in SLA studies. The scarcity of CA studies is especially no-ticeable here, because the concept of repair is borrowed from CA (see Jefferson1987; Schegloff 1979, 1991, 1992, 1997a, 1997b, 2000; Schegloff, Jefferson, &Sacks 1977). However, this situation is now beginning to change. Two earlystudies by Gaskill (1980) and Schwartz (1980) on SL repair have been followedup in the last few years by a spate of CA-for-SLA work (see Firth & Wagner1997; Kasper 2002; Kasper & Ross 2001; Markee 1994, 1995, 2000, 2003, 2004a,2004b, 2005a, 2005b; Mori 2002; Seedhouse 1997, 1999; Wagner 1996; Willey2001). Other writers have also used CA techniques – for example, van Lier(1988), or, more recently, Lazaraton (2003, 2004), who labels her work “mi-croanalysis” – but do not claim to be doing CA per se. CA techniques havealso been used by researchers who frame their work in terms of socioculturaltheory (Ohta 2001a, 2001b), systemic grammar (Young & Nguyen 2002), or,potentially, variationist approaches to SLA (Tarone & Liu 1995). Finally, thework of Koshik (2002a, 2002b, 2003); Lerner (1995), McHoul (1978, 1990)and Olsher (2001) on the structure of classroom discourse is also relevant toCA-for-SLA research.


Numa Markee

These developments are an encouraging step in the right direction. Butthis body of research is still dwarfed, both in size and influence, by a longer,better established and, above all, a predominantly experimental, tradition inmainstream SLA. Now, CA-for-SLA does not have an inalienable right to havea seat at the SLA table: it has to demonstrate that its insights are relevant anduseful to SLA studies. Some of the pertinent issues have already been vigorouslydebated (see the exchanges between Firth & Wagner 1997 on the one hand,and Gass 1998, 2001; Kasper 1997 and Long 1997, 1998 on the other). Here,I review the issues in terms of the numerator, denominator, significance, anddomain problems identified by Schegloff (1993), which all affect the possibilityof meaningful quantification of social interaction in SLA studies.

The domain problem

It is a fundamental tenet of the IH that “free conversation is notoriously pooras a context for driving interlanguage development . . . in contrast, tasks thatorient participants to shared goals and involve them in some work or activ-ity produce more negotiation work’ (Long 1996:448). Let me now restate thisclaim in CA-for-SLA terms to highlight certain problems: (1) Ordinary con-versation and institutional talk (specifically, classroom talk) are observablydifferent speech exchange systems; (2) classroom talk provides better struc-tural opportunities for SLA to occur than ordinary conversation does becausethe repair practices that participants orient to as they do classroom talk providequalitatively better opportunities for learners to notice new forms and to nego-tiate meaning; (3) classroom talk provides a greater number of opportunitiesthan ordinary conversation does for learners to engage in such noticing andnegotiation. I fully agree with the first of these propositions and have calledfor just such a research agenda (Markee 2000). CA originally focused on thestudy of ordinary conversation, which may be glossed as “casual, social talkthat routinely occurs between friends and acquaintances, either face-to-face oron the telephone” (Markee 2000:24). Over time, it has also come to encompassthe study of institutional talk such as news, medical, courtroom and classroomtalk (see, for example, Boden & Zimmerman 1991; Button 1991; Clayman &Heritage 2002; Drew & Heritage 1992; Heath 1989; Heritage & Roth 1995;McHoul 1978, 1990; J. D. Robinson 1998; Stivers 2001). CA clearly possessesboth the methodological tools and the expertise that are necessary to explicatehow speech exchange systems differ one from another.

I am also fascinated by Propositions 2 and 3. However, mainstream SLAand CA-for-SLA both have to confront seven unresolved problems if the im-



plications of these three propositions are to be sustained: (1) Despite the vastamount of experimental research that has been done on the role and func-tion of negotiation in mainstream SLA, few benchmark qualitative studies existthat systematically compare and contrast the sequential, turn-taking and re-pair practices of (SL) ordinary conversation with those of (SL) classroom talk(Liddicoate 1997; however, see also Kasper 2002). (2) This lack of benchmarksis highly problematic for mainstream SLA, because Proposition 1 is the the-oretical foundation on which Propositions 2 and 3 rest. (3) At the moment,mainstream SLA research does not adequately distinguish between ordinaryconversation and institutional talk, nor among different institutional varietiesof talk (see, for example, Varonis & Gass 1985b). (4) Mainstream SLA stud-ies seems curiously uninterested in documenting and analyzing the observablelearning consequences of specific conversational acts by learners. Thus, whilethe theoretical construct of i+1 may eventually yield interesting insights aboutSLA as an abstract, aggregated, group phenomenon, we are still rarely offeredanalyses that show how individual learners actually do comprehensible inputin real time, and how such input leads first to understanding and then to learn-ing that is instantiated as comprehensible output, if only in the short term (foran example of such research, Markee 1994, 2000). (5) Until we have a betterqualitative understanding of how these speech exchange systems (i.e., domains)are organized, continuing attempts to quantify talk-in-interaction and to gen-eralize from these data will inevitably be premature. To be valid and reliable,experimental work must be properly grounded in prior analyses that explicatehow that particular piece of talk was produced at that particular moment in thatparticular speech event to achieve that particular action. (6) Whatever results ofthe IH are eventually sustained, current claims concerning the role of repairin language learning posited are likely only generalizable to the domain of in-structed SLA, not SLA as a whole. Finally, note that the idea encapsulated inProposition 1 goes far beyond issues of quantification. It also invokes one ofthe most vexing controversies in SLA studies today.

I am referring, of course, to the question of how psycholinguistic ques-tions of language learning intersect with sociolinguistic aspects of languageuse. As Long (1997, 1998) correctly notes in his responses to Firth and Wagner(1997), social cognitive researchers working within the framework of the IHhave conducted a great deal of research on the role of conversational repairin facilitating negotiation. However, in the face of radical, sociolinguistically-inspired critiques of their work by Firth and Wagner (1997) and others, thereseems to be a tendency among some prominent social cognitivists to backpedal


Numa Markee

Second Language Studies

SLA SL Use

Universals Transfer Interaction Speechacts

CommunicationStrategies

Figure 1. A characterization of research in “SLA” (Gass 1998:88)

from the logical implications of their earlier work and to retreat into a worldof psycholinguistic isolationism. For example, Gass asserts that:

. . . it is true that in order to examine [changes in linguistic knowledge], onemust consider language use in context. But in some sense this is trivial; theemphasis in input and interaction studies is on the language used and not onthe act of communication. (Gass 1998:84; emphasis in the original)

This position maintains an untenable distinction between psycholinguistic andsociolinguistic views of language. Of course, Gass is sensitive to such a criticismand softens her position by saying that “views of language that consider lan-guage as a social phenomenon and views of language that consider languageto reside in the individual do not necessarily have to be incompatible” (Gass1998:88). Unfortunately, this turns out to be little more than a pro forma dis-claimer, because Gass specifies the relationship between studies in SLA and SLuse as shown in Figure 1.

Gass finds some of Firth and Wagner’s arguments regarding the scope ofSLA puzzling, but her attempts to compartmentalize issues of language useand language acquisition illustrated in Figure 1 are equally odd. An empiricallygrounded understanding of how learners’ interlanguage knowledge (as this isreflected in and through their talk-in-interaction) progresses from A to B, andwhat “events promote or hinder such progress” (Kasper 1997:310) cannot bedismissed as a “trivial” issue. It is a crucial foundation for the IH. If this meansthat advocates of the IH have to accept that language acquisition and use areindivisible components of the SLA enterprise, then this is not to be seen as athreat to the disciplinary integrity of SLA studies. It is a consequence of the IH’sown theoretical interests in social interaction as a resource for SLA. I return tothese issues in the empirical and concluding sections of this chapter.



The significance problem

A key assumption of experimental research is that the nature or strength ofrelationships or claimed differences between treatment effects, etc. must besignificant, in the technical, statistical sense of this word. That is, in order todevelop a viable mathematical model of hypothesized relationships betweenindependent and dependent variables, we must be able to find enough aggre-gated instances of a given behavior to be confident that a finding is robust.However as Schegloff (1993) points out, this etic, or researcher’s, perspectiveon knowledge construction is not the only way of conceptualizing the relevanceof analytical findings. As Schegloff (1993:101) memorably declares, “one is alsoa number” (emphasis in the original). What he means by this is that, from anemic, or participant’s, perspective:

The best evidence that some practice of talk-in-interaction does, or can do,some claimed action, for example, is that some recipient on some occasionshows himself or herself to have understood it, most commonly by so treatingit in the ensuing moments of the interaction, and most commonly of all, next.Even if no quantitative evidence can be mustered for a linkage between thatpractice and that resultant “effect,” the treatment of the linkage as relevant –by the parties on that occasion, on which it was manifested – remains.

(Schegloff 1993:101, emphasis in the original)

This position has at least two profound implications for the way in which CAresearch is carried out. First, the warrant for any analytic claims that are madeabout how ordinary conversation and institutional talk are organized must belocated in the local context of participants’ talk, and explicated in terms ofwhat members understand each other to be doing at the time that they aredoing it. Thus, CA work is always based in the first instance on the exhaus-tive, micro-analysis of single cases of interaction. Second, analyses of collections(= “aggregates of single instances” of, say, second position repairs; see Schegloff1993:102) may also be carried out. However, since the goal of analysts is todevelop a member’s understanding of participants’ behaviors, there can be nosuch thing as an “outlier,” which may be discarded as unrepresentative of groupnorms. In order to account for apparently deviant behaviors, deviant case anal-ysis is used to provide a complete account of a phenomenon. The most wellknown example of this technique is Schegloff ’s (1968) analysis of sequencingin conversational openings on the telephone. His first analysis accounted for499 out of 500 cases in the collection. However, Schegloff reanalyzed the entirecorpus to yield all 500 cases of this phenomenon. I return to these issues in theempirical section of this chapter.


Numa Markee

These issues are generally not well understood in mainstream SLA. For ex-ample, the fact that most CA research does not attempt to present data in aquantified form is seen as an annoying, almost irrational quirk by some writ-ers (see Long 1997), rather than as the product of a principled epistemologicalstance. Furthermore, in the rush to quantify data that has characterized main-stream SLA studies from its inception, the fact that experimental research inSLA is not without its own problems has largely been ignored (see Markee2000:Chapter 2).

First, the functional categories of repair used by mainstream SLA research(comprehension checks, clarification requests, confirmation checks, recasts,etc.) are frequently so ambiguous or decontextualized that it is often not clearwhether a particular fragment of talk actually constitutes, say, a comprehensioncheck, or a completely different category of repair (Ohta 2001b). The ambigu-ous nature of these categories is problematic from an emic perspective becauseit is not clear that members orient to these categories as distinct constructs.And from an etic perspective, the decontextualization of these categories is evenmore problematic, since experimentation requires that categories used for cod-ing be discrete in order for the subsequent analysis to be meaningful. Varonisand Gass (1985b) have acknowledged this problem (see also how Oliver 1998,2000 deals with double-coding issues) but this admission does not seem tohave dampened the enthusiasm with which these categories are still employedin experimental studies.1

Second, mainstream SLA’s preoccupation with quantifying SL data as thedefault mode of analysis has prompted Aston (1986) to point out that a quanti-tative, “more the merrier” approach to investigating SL repair fails to acknowl-edge that there may be considerable negative social consequences for memberswho engage in excessive repair of their interlocutors’ talk.

The denominator problem

Schegloff (1993) explains what the denominator problem is by critiquing aquasi-experimental study that sought to quantify the notion of sociability interms of how many times per minute subjects laughed during the experiment.After noting that laughter in naturally occurring talk-in-interaction is alwaysa responsive phenomenon, whose quality and placement in the ongoing talktherefore matter in terms of members’ assessments of whether such laughter isaffectively appropriate or not, he points out:



If one wants to assess how much someone laughed, to compare it with otherlaughter by that person or by others, then a denominator will be needed thatis analytically relevant to what is to be counted because it is organizationallyrelated to it in the conduct of interaction. And minutes are not.

(Schegloff 1993:104, emphasis in the original)

This suggests that work on repair in SLA that quantifies how often repairs occurwithout specifying a relevant denominator (for example, the number of repairsthat occur per task type) are likely to be premature. This is because the specifi-cation of task as a denominator requires prior grounded research to establishwhether, for example, one-way and two-way tasks constitute distinct domains.To date, this research has not been carried out in mainstream SLA studies. Ireturn to this issue in the conclusion to this chapter.

The numerator problem

The numerator problem has to do with the raw frequency of a particular be-havior in talk. So, to continue our repair example, quantifying the number ofrepairs that occur in a speech event or aggregation of speech events to find outwhether this number reaches statistical significance does not tell us anythingabout how repair is achieved by participants in and through talk. If quanti-fied data are to be meaningful, we need to understand that the observableabsence or rarity of specific types of repair is as pertinent to a comprehensiveanalysis of repair as their expected presence or frequency in talk. We are in fact re-quired to develop the notion of an “environment of relevant possible occurrence”(Schegloff 1993:106). So, we must have a detailed sequential understanding ofwhere and when an analytic warrant exists for saying that a repair is present,absent, frequent, or rare in a given piece of talk. Again, we can only do this bygrounding our analyses of an individual case in the practices that distinguishone speech exchange system from another. I illustrate how to do this in thefollowing section.

Conversation analysis: An emprical example

In the empirical analysis that follows, as required by Kasper (1997), I proposeto show that CA can explicate how certain events in classroom talk can hinderor at least delay acquisitionally relevant talk. Furthermore, to connect againwith the theme of CA’s stance on quantification and generalizability examined


Numa Markee

earlier in this paper, I make this argument by using deviant case analysis tointerpret one student’s use of a rare, possibly unique, example of a Counter-Question – that is, a type of question that is inserted, normally by teachersbut in this case by the student – between the first and second pair parts ofa Question-Answer adjacency pair. Of course, from an experimental perspec-tive, such data would never be used to make generalizations about SLA. Butfrom a CA point of view, it is the very rarity of the example that allows us todevelop a deeper strategic understanding of the different rights and responsibil-ities of members in teacher-student speech exchange systems. It is these deeperstrategic insights about the structural organization of talk that are potentiallygeneralizable, not the specific tactics of individual participants in a particularconversation.

Fragment 1 below (see Markee 1995 for a full analysis) comes from a 50minute undergraduate university ESL class that was video- and audio-tapedin 1990. The class, which was discussing the issue of German reunification,was taught by an experienced NS English teacher. The methodology used bythe teacher involved task-based interactions mediated through small groupwork, which provided learners with opportunities to focus on form on anas-needed basis.

During dyadic talk between L9 and L11 that occurred before the interac-tion reproduced below, L11 had had trouble on two separate occasions tryingto work out what the phrase “you pretend to pay us and we pretend to work”means. Note that this prior talk was constructed as a speech exchange systemthat approximated in many ways the locally organized turn-taking and repairpractices of ordinary conversation (Sacks, Schegloff, & Jefferson 1974). That is,although the topics of L9’s and L11’s talk were preset by comprehension ques-tions in the materials, neither participant had a pre-allocated right to take orassign turns or to initiate repairs. However, when we join the interaction at line09 of Fragment 1, where L9 seeks the teacher’s (= Jane’s) help, we are witness-ing the beginning of L9 and L11’s third attempt to resolve this problem (seeAppendix 1 for transcription conventions).

Fragment 1((L9 and L11 are looking down at their reading materials. L9 is holding the pages of hismaterials in his right hand, and L11 is leaning his head on his left hand.))01 L9: [((L9 leans toward L11))]02 L9: can [we call jane maybe, ((unintelligible)).03 L9: [X . . . . . . . . . . . . . . . . .X04 (0.3)05 L11: myeah.



06 [((L9 and L11 both look up toward the front of the class. L9 holds07 his chin in his right hand in a thinking posture, and L11 rests his left08 hand on his left thigh))]09 L9: Q [nt jane?]10 T: uh huh?11 (-------[---) ((T’s shadow on the floor indicates she is beginning to12 move toward L9 and L11))13 L11: [X14 L9: Q [your input plea[h huh [huh] huh] ((L9 smiles as he begins to15 laugh and looks down during16 the last two laughter tokens))17 T: [huh]18 L11: [h huh [huh]huh] [huh. huh19 L11:20 ((T appears in the camera shot. She is approaching L9 and L11))21 L11:22 L9: Q ·hhhhh there is this e:::h ((diminishing volume through “e::::h”))23 ((L9 looks down at his reading materials))24 (0.6) some sort of an idiom25 ((T arrives next to L9 and L11))26 Q you pretend to pay us [and we pretend to work27 [((T leans down and reaches with her right hand28 to identify where the problem phrase is in L9’s29 reading text))30 [((L9 and L11 both look up at T))31 T: CQ(D) [ok. [what do you think that could be:]32 [((T stands up straight as she speaks.))33 CQ(D) (0.3) do you have any idea? ((T is looking at L11. She holds the open34 palm of her right hand up and makes a slight beating gesture as she35 says the word “you”.))36 L9: [X ((L9 looks at L11))37 L11: CQ → [do you- do you know what the word pretend, (.) means,]38 [((L11 raises his left hand to touch his his nose and beats out39 the stress on the words “know” and “pretend”. His hand then40 comes to a resting, thinking posture position, supporting his chin.41 As she listens to L11, T holds her chin in the fingers of her right hand42 in a thinking posture.))43 (1.0)44 T: CQ do [I know what the word pretend means]45 [((T leans forward slightly toward L11 and touches her chest with46 her right hand as she says the word “I”. In addition, her intonation47 steadily rises through the turn, ending on a high note. Meanwhile,


Numa Markee

48 L9 briefly looks at T as she says the word “I” and then moves his49 gaze back onto L11.]))50 L11: A [yeah- I- I /dawt/-]51 [((L11 does a circling gesture in front of him with his left hand, which52 ends with his open left hand resting on his chest as he says the second53 “I”))]54 L11: [I don’t know that see]55 [((L11 shakes his head slightly four times))]56 T: CQ(D) oh ok owho-o odo-o ((T looks quickly to her right and then her left at57 the rest of the class.)) does anybody know what the word pretend58 means. ((T is speaking to the whole class.))

At lines 09, 11, and 14–26, L9 calls the teacher over and identifies the phrase“you pretend to pay us and we pretend to work” as an idiom that he and L11do not understand. At lines 31 and 33, instead of answering this question di-rectly, T does a Counter-Question (CQ) turn that is formulated as a display (D)question. This verbal behavior (along with the accompanying gestural behav-iors at lines 33–35 that clearly nominate L11 as next speaker) may be analyzedas a move that puts T back in sequential position not only to ask questionsto which there is a known answer but also to comment on students’ answersin the commenting C slot of Question-Answer-Comment (QAC) sequences(see McHoul 1978, and Markee 1995, on QAC sequences and also the earlierwork of Mehan 1979, and Sinclair & Coulthard 1975, on Initiation-Response-Feedback sequences in classroom talk). Thus, by initiating this CQ(D) turnat lines 31 and 33, T asserts her instructor’s right to control the pedagogi-cal agenda by orienting to the practices of classroom talk, not ordinary con-versation. The prototypical trajectory of CQ(D) sequences may therefore besummarized as follows:

Owner of the turn: L → T → L TTurn type Q → CQ(D) → A C

Figure 2. Trajectory of CQ(D) sequences (Markee 1995:75)

But, at line 37, L11 asks whether T knows what the word “pretend” means.Now, L11 does not stress the word “you” here. He may be trying to tell T thatit is he, not L9, who does not understand the meaning of this word, a piece ofinformation which T does not know, since she has not participated in L9 andL11’s prior dyadic talk. Whatever L11 may have been trying to accomplish atline 37, T reacts at line 44 by treating L11’s turn as a CQ turn that challenges herstatus as a NS English teacher. She accomplishes this by indicating that there



is potentially some trouble at line 43, where she pauses for a full second. Shethen does a CQ turn of her own at line 44 that rhetorically demands recogni-tion of her status as a NS teacher as the preferred response (note T’s use of theheavy contrastive stress on the word “I”, the rising intonation through the turnat line 44, and the accompanying, highly emphatic, visual deixis of the handgesture at line 45). At line 50, L11 begins by reiterating his question by saying“yeah” (a dispreferred, or marked, response). However, after some initial per-turbations (note the cut-offs in the first part of this turn), he displays a newunderstanding of what the teacher is doing in her previous turn and beginsto repair his social relationship with T. His first attempt still comes out rathergarbled: “I- I [dawt]-” is not only marked by hesitations and cut-offs, but theword phonetically transcribed as [dawt] may be a first attempt to say “I don’tknow that”, which seems to “come out wrong” under the communicative stressof the moment. In the completion of this turn at line 54, L11 repairs this firstattempt by saying “I don’t know that see” and achieves the preferred responseof acknowledging that he does not know what the word “pretend” means andthat he therefore needs T’s expert help.

At line 56, T first acknowledges her new understanding and acceptance ofL11’s clarification by using the change of state token “oh ok” (Heritage 1984)and then, as in lines 31 and 33, moves on in the second part of her turn toredirect the question to other interlocutors, in this case, the rest of the class.This Q turn is again done as a D question, and the rest of the of the talk (notreproduced here) runs off in the canonical QAC order. The actual trajectory ofFragment 1 can thus be diagrammed as follows:

Owner of the turn: L T L T L TTurn type Q CQ(D) CQ CQ A Q(D) . . .

Figure 3. Trajectory of Excerpt 1

From a CA-for-SLA perspective, this excerpt is of considerable analytical inter-est. Although L11’s behavior is unique in the eight classes and approximately26 hours of classroom recordings that form my database, the analysis summa-rized in Figure 3 is not a counter-example to the analysis of the underlyingpractices that govern the speech exchange system extrapolated from the pro-totypical sequential trajectory shown in Figure 2. In fact, this deviant caseanalysis demonstrates that teachers have pre-allocated rights to doing specificturns (Q, CQ(D) and A) in QAC sequences and that learners do not havethe right to do turns that may be interpreted as CQ(D) talk. Furthermore, if


Numa Markee

a student unexpectedly fails to orient to the sequentially located turn-takingconventions and repair practices of classroom talk, this transgression is an actwhich the teacher may censure through the use of a rhetorical CQ(D) turn thatother-initiates repair.

Note also that how participants resolve the ambiguity of which speech ex-change system they are orienting to at that particular moment in time as theynegotiate that particular repair is a complex issue. For example, the use of thechange of state token “oh ok” by T at line 56 is notable because we knowthat this conversational object is characteristic of ordinary conversation, notinstitutional talk (Heritage 1984). This is because teachers rarely ask learnerstrue information questions (Long & Sato 1983).2 Furthermore, learners tendto ask teachers very few questions compared to the number of questions thatinstructors ask students (Dillon 1981, 1988; White & Lightbown 1984).

Here, T asks L11 at line 44 a question that is fishing for the preferred answerthat T does know what the word “pretend” means. But the way L11 actuallydoes this at lines 50 and 54 is by formulating his answer in terms of his igno-rance concerning what “pretend” means, not the teacher’s. T then responds atline 56 by saying “oh ok”, thus exploiting a practice of ordinary conversationthat hearably treats L11’s answer as new information that she had not under-stood before. This verbal tactic enables T to end the confrontation and to alignwith L11’s attempt to repair his social relationship with her.

Conclusion

I conclude this chapter with the following six observations. First, we shouldnote that the disciplinary preference for explanation over interpretation and in-ference that has characterized applied linguistic and SLA research over the last30 years is in the process of changing. For example, in their influential call fora re-opening of the research agenda on SL motivation, Crookes and Schmidt(1991) issued a call for more qualitative as well as quantitative research on theissues. Dörnyei (2000) and Dörnyei and Csizér (in press), whose own workis heavily experimental, have issued similar calls for more situated, process-oriented accounts of motivation. And McGroarty (1998) has gone even further,claiming that social constructivist approaches to theory building about lan-guage learning and use may provide the most interesting sources of insight forfuture applied linguistic research. Thus, the issues discussed here are embeddedin a larger discussion of the potential value of quantification in such research.



Second, the empirical analysis of Fragment 1 responds to Kasper’s (1997)call for CA-for-SLA to show how social events promote or, as in this case, hin-der the possibility of interlanguage development from occurring. As we haveseen, the social relationship between T and L11 had to be repaired before fur-ther language learning-oriented talk could continue. The tactical face-savingwork done in Fragment 1 illustrates the larger strategic fact that classrooms aresocial environments as much as they are learning places. This conclusion alsodemonstrates how CA-for-SLA can empirically confront Gass’ (1998) theoret-ical claim that language acquisition and language use are distinct aspects ofsecond language studies. The clear implication of this analysis is that they areso closely intertwined as to be theoretically inseparable.

Third, despite the great importance that CA attaches to single case analy-sis, it is important to understand that CA does not a priori deny the value ofa quantitative approach to social interaction. What I have argued in this chap-ter is that we must develop a rigorous, qualitative understanding of how SLlearning and use are done by participants in order to motivate grounded quan-titative research (for recent examples of such grounded experimental work inCA, see Heritage & Stivers 1999; Stivers 2001).

Fourth, in an ideal world, this grounding work would have been done priorto embarking on a significant program of experimental research. From a purelypragmatic point of view, we are clearly well beyond the point of being able toobserve any so-called canonical order of doing qualitative research first andthen following up with quantitative research (see also Crookes & Schmidt 1991on this point). Thus, under present circumstances, CA-for-SLA research has totake on the unusual epistemological function of confirming, not just generatinghypotheses about SL learning and use (Markee 2000).

Fifth, we must ultimately develop both interpretive and predictive expla-nations of SLA processes that are coherent in their own terms and that arealso properly informed by each other. Indeed, as Schegloff (1993) suggests,the social construction of repair is a type of behavior that could in the longterm potentially benefit from follow up experimental studies that have beenproperly grounded in qualitative CA work. But we are not there yet and, interms of what the future holds for SLA studies, mainstream SLA researcherswho use quantified data can expect to be asked with increasing insistence – asshould qualitative researchers also: see Edge and Richards (1998) – to “showtheir warrant” for making claims X, Y or Z.

Finally, I accept that the development of a CA-for SLA agenda is contro-versial, in that it broadens the scope of mainstream SLA research, challengesconventional notions of the “proper” relationships between qualitative and


Numa Markee

quantitative approaches to scholarship, and also forces us to rethink what andhow we generalize from data. At the same time, I wish to argue that the resultsof such a respecification of our field will ultimately strengthen SLA studies,not weaken their fundamental disciplinary integrity. Proponents of differentversions of SLA should certainly continue to engage in vigorous debate aboutthe strengths and limitations of the research traditions within which we work.Ultimately, however, we are all concerned with explicating the same complexphenomenon of how and why SLs are learned, and it is in this cooperative spiritthat the arguments developed in this paper have been offered.

Notes

. In other words, this is a problem that cannot be solved by better inter-rater/coder relia-bility procedures.

. Duff (personal communication, January 31, 2005), suggests that this is perhaps toostrong a statement. While the use of true information questions was certainly rare in theearly 1980s, there is increasing evidence that language teachers now do ask many more suchquestions today than they did 25 years ago.

Appendix 1: Transcription conventions

CA transcription conventions (based on Atkinson & Heritage 1984b).

Identity of speakersT: teacherL1: identified learner (Learner 1)L: unidentified learnerL3?: probably Learner 3LL: several or all learners simultaneously

Simultaneous utterancesL1: [yes simultaneous, overlapping talk by two speakersL2: [yeh

L1: [huh? [oh] I see] simultaneous, overlapping talk by three (or more) speakersL2: [what]L3: [I dont get it ]

Contiguous utterances= (a) turn continues at the next identical symbol on the next line

(b) if inserted at the end of one speaker’s turn and the beginning of thenext speaker’s adjacent turn, it indicates that there is no gap at allbetween the two turns



Intervals within and between utterances(0.3) (1) (0.3) = a pause of 0.3 second;

(1.0) = a pause of one second.

Characteristics of speech delivery? rising intonation, not necessarily a question! strong emphasis, with falling intonationyes. a period indicates falling (final) intonationso, a comma indicates low-rising intonation suggesting continuationgo:::d one or more colons indicate lengthening of the preceding sound; each

additional colon represents a lengthening of one beatno- a hyphen indicates an abrupt cut-off, with level pitchbecause underlined type indicates marked stressSYLVIA capitals indicate increased volume◦the next thing◦ degree sign indicates decreased volume.hhh in-drawn breathhhh laughter tokens

Commentary in the transcript((coughs)) verbal description of actions noted in the transcript, including non-

verbal actions((unintelligible)) indicates a stretch of talk that is unintelligible to the analyst. . . . (radio) single parentheses indicate unclear or probable item

Eye gaze phenomenaThe moment at which eye gaze is coordinated with speech is marked by an X and the du-ration of the eye gaze is indicated by a continuous line. Thus in the example below, themoment at which L11’s eye gaze falls on L9 in line 412 coincides with the beginning of histurn at line 413

[X ((L11’s eye gaze is now directed at L9))412 L11: → ≈ [◦you can call me and then [I can say you ] the the address413 L9: → [((L9 nods 4 times]

Eye gaze transition is shown by commas[X , ,

The moment at which there ceases to be eye contact (as when a participant looks down oraway from his/her interlocutor) is shown by periods

[X . . .

Other transcription symbolsco[l]al brackets indicate phonetic transcription→ an arrow in the margin of a transcript draws attention to a particular

phenomenon the analyst wishes to discuss.


Numa Markee

References

Aston, G. (1986). Trouble-shooting in interaction with learners: The more the merrier?Applied Linguistics, 7, 128–143.

Atkinson, J. M. & Heritage, J. (Eds.). (1984). Structures of social action. Cambridge: CUP.Boden, D. & Zimmerman, D. H. (Eds.). (1991). Talk and social structure. Cambridge: Polity

Press.Brock, C., Crookes, G., Day, R., & Long, M. H. (1986). The differential effects of corrective

feedback in native speaker/non-native speaker conversation. In R. R. Day (Ed.), Talkingto learn: Conversation in second language acquisition (pp. 229–236). Cambridge, MA:Newbury House.

Button, G. (Ed.). (1991). Ethnomethodology and the human sciences. Cambridge: CUP.Bygate, M. (1988). Units of oral expression and language learning in small group interaction.

Applied Linguistics, 9, 59–82.Caroll, S. & Swain, M. (1993). Explicit and implicit feedback. An empirical study of the

learning of linguistic generalizations. Studies in Second Language Acquisition, 15, 357–386.

Chaudron, C. (1977). A descriptive model of discourse in the corrective treatment oflearners’ errors. Language Learning, 27, 29–46.

Chaudron, C. (1987). The role of error correction in second language teaching. In B. K.Das (Ed.), Patterns of classroom interaction in South East Asia (pp. 17–50). Singapore:SEAMEO-RELC Regional Language Centre.

Chaudron, C. (1988). Second language classrooms. Research on teaching and learning.Cambridge: CUP.

Clayman, S. & Heritage, J. (2002). Questioning Presidents: Journalistic deference andadversarialness in the press conferences of Eisenhower and Reagan. Journal ofCommunication, 52, 749–777.

Crookes, G. (1989). Planning and interlanguage variation. Studies in Second LanguageAcquisition, 11, 367–383.

Crookes, G. & Rulon, K. A. (1988). Topic and feedback in native speaker/non-native speakerconversation. TESOL Quarterly, 22, 675–681.

Crookes, G. & Schmidt, R. W. (1991). Motivation: Reopening the research agenda. LanguageLearning, 41, 469–512.

Dillon, J. T. (1981). Duration of response to teacher questions and statements. ContemporaryEducational Psychology, 6, 1–11.

Dillon, J. T. (1988). The remedial status of student questioning. Journal of CurriculumStudies, 20, 197–210.

Dörnyei, Z. (2000). Motivation. London: Longman.Dörnyei, Z. & Csizér, K. (2002). Some dynamics of language attitudes and motivation:

Results of a longitudinal nationwide survey. Applied Linguistics, 23, 421–462.Doughty, C. (1991). Second language instruction does make a difference. Evidence from an

empirical study of SL relativization. Studies in Second Language Acquisition, 13, 431–469Doughty, C. & Pica, T. (1986). “Information gap” tasks: An aid to second language

acquisition? TESOL Quarterly, 20, 305–325.



Doughty, C. & Varela, C. (1998). Communicative focus on form. In C. Doughty & J.Williams (Eds.), Focus on form in classroom language acquisition. Cambridge: CUP.

Doughty, C. & Williams, J. (Eds.). (1998). Focus on form in classroom language acquisition.Cambridge: CUP.

Drew. P. & Heritage, J. (Eds.). (1992a). Talk at work: Interaction in institutional settings.Cambridge: CUP.

Duff, P. A. (1986). Another look at interlanguage talk: Taking task to task. In R. Day (Ed.),Talking to learn (pp. 147–181). Rowley, MA: Newbury House.

Edge, J. & Richards, K. (1998). May I see your warrant please? Justifying outcomes inqualitative research. Applied Linguistics, 19, 334–356.

Ellis, R. (1985). Teacher-pupil interaction in second language development. In S. M. Gass& C. Madden (Eds.), Input in second language acquisition (pp. 69–85). Rowley, MA:Newbury House.

Firth, A. & Wagner, J. (1997). On discourse, communication, and (some) fundamentalconcepts in SLA research. The Modern Language Journal, 81, 285–300.

Foster, P. (1998). A classroom perspective on the negotiation of meaning. Applied Linguistics,19, 1–23.

Foster, P. & Skehan, P. (1996). The influence of planning on performance in task-basedlearning. Studies in Second Language Acquisition, 18, 299–324.

Gaskill, W. (1980). Correction in native speaker – non-native speaker conversation. In D.Larsen-Freeman (Ed.), Discourse analysis in second language research (pp. 125–137).Rowley, MA: Newbury House.

Gass, S. M. (1988). Integrating research areas: A framework for second language studies.Applied Linguistics, 9, 198–217.

Gass, S. M. (1997). Input, interaction and the second language learner. Mahwah, NJ: LawrenceErlbaum.

Gass, S. M. (1998). Apples and oranges: Or, why apples are not oranges and don’t need tobe. A response to Firth and Wagner. The Modern Language Journal, 82, 83–90.

Gass, S. M. (2001). SLA: A 2001 space odyssey. Plenary address to the Annual AAALConference, St. Louis, MI, March 2001.

Gass S. M. & Varonis, E. M. (1984). The effect of familiarity on the comprehensibility ofnon-native speech. Language Learning, 34, 65–89.

Gass, S. M. & Varonis, E. M. (1986). Sex differences and in non-native speaker/nativespeaker interaction. In R. R. Day (Ed.), Talking to learn: Conversation in second languageacquisition (pp. 327–351). Cambridge, MA: Newbury House.

Gass, S. M. & Varonis, E. M. (1994). Input, interaction and second language production.Studies in Second Language Acquisition, 16, 283–302.

Gregg, K. R. (1984). Krashen’s monitor and Occam’s razor. Applied Linguistics, 5, 79–100.Gregg, K. R. (1993). Taking explanation seriously; or, let a couple of flowers bloom. Applied

Linguistics, 14, 276–294.Gregg, K. R. (1996). The logical and developmental problems of second language

acquisition. In W. C. Ritchie & T. K. Bhatia (Eds.), Handbook of second languageacquisition (pp. 49–81). New York: Academic Press.

Harley, B. (1989). Functional grammar in French immersion: A classroom experiment.Applied Linguistics, 10, 331–359.


Numa Markee

Hatch, E. M. (1978). Discourse analysis and second language acquisition. In E. Hatch (Ed.),Second language acquisition: A book of readings (pp. 402–435). Rowley, MA: NewburyHouse.

Hawkins, B. (1985). Is an ‘appropriate’ response always so appropriate? In S. M. Gass &C. Madden (Eds.), Input in second language acquisition (pp.162–178). Rowley, MA:Newbury House.

Heath, C. (1989). Pain talk: The expression of suffering in the medical consultation. SocialPsychology Quarterly, 52, 113–125.

Heritage, J. (1984). A change-of-state token and aspects of its sequential placement. InJ. Atkinson & J. Heritage (Eds.), Structures of social action: Studies in conversationanalysis (pp. 299–345). Cambridge: CUP & Paris: Editions de la Maison des Sciencesde l’Homme.

Heritage, J. & Roth, A. L. (1995). Grammar and institution: Questions and questioning inthe broadcast news interview. Research on Language and Social Interaction, 28, 1–60.

Jefferson, G. (1987). On exposed and embedded correction in conversation. In G. Button& R. E. Lee (Eds.), Talk and social interaction (pp. 86–100). Clevedon: MultilingualMatters.

Kasper, G. (1997). “A” stands for acquisition: A response to Firth and Wagner. The ModernLanguage Journal, 81, 307–312.

Kasper, G. (2002). Conversation analysis as an approach to second language acquisition:Old wine in new bottles? Invited talk, SLATE speaker series, University of Illinois atUrbana-Champaign, March 13, 2002.

Kasper, G. & Ross, S. (2001). Is drinking a hobby, I wonder: Other-initiated repair inlanguage proficiency interviews. Paper presented at the Annual AAAL Conference, St.Louis, MI. March 2001.

Kormos, J. (2001). The role of attention in monitoring second language production.Language Learning, 50, 343–384.

Koshik, I. (2002a). Designedly incomplete utterances: A pedagogical practice for elicitingknowledge displays in error correction sequences. Research on social interaction, 35(3),277–309.

Koshik, I. (2002b). A conversation analytic study of yes/no questions which imply reversedpolarity assertions. Journal of Pragmatics, 34, 1851–1877.

Koshik, I. (2003). Wh questions used as challenges. Discourse Studies, 5, 51–77.Kowal, M. & Swain, M. (1994). Using collaborative production tasks to promote students’

language awareness. Language Awareness, 3, 73–93.Krashen, S. D. (1980). The input hypothesis. In J. E. Alatis (Ed.), Current issues in bilingual

education (pp. 168–180). Washington, DC: Georgetown University Press.Krashen, S. D. (1981). Second language acquisition and second language learning. Oxford:

Pergamon.Krashen, S. D. (1982). Principles and practice in second language acquisition. Oxford:

Pergamon.Krashen, S. D. (1985). The input hypothesis. London: Longman.Krashen, S. D. (1989). We acquire vocabulary and spelling by reading: Additional evidence

for the input hypothesis. Modern Language Journal, 73, 440–464.



Lerner, G. H. (1995). Turn design and the organization of participation in instructionalactivities. Discourse Processes, 19, 111–131.

Lazaraton, A. (2003). Incidental displays of cultural knowledge in the nonnative-English-speaking teacher’s classroom. TESOL Quarterly, 37, 213–245.

Lazaraton, A. (2004). Gesture and speech in the vocabulary explanations of one ESL teacher:A microanalytic inquiry. Journal of Language Learning, 54(1), 79–117.

Liddicoat, A. (1997). Interaction, social structure, and second language use: A response toFirth and Wagner. The Modern Language Journal, 81, 313–317.

Lightbown, P. M. & Spada, N. (1990). Focus-on-form and corrective feedback incommunicative language teaching: Effects on second language learning. Studies inSecond Language Acquisition, 12, 429–448.

Long, M. H. (1980). Input, interaction and second language acquisition. PhD Dissertation,University of California at Los Angeles.

Long, M. H. (1981). Input, interaction and second language acquisition. In H. Winitz (Ed.),Native language and foreign language acquisition (pp. 259–278). Annals of the New YorkAcademy of Sciences, 379, 259–278.

Long, M. H. (1983a). Native speaker/non-native speaker conversation and the negotiationof comprehensible input. Applied Linguistics, 4, 126–141.

Long, M. H. (1983b). Linguistic and conversational adjustments of non-native speakers.Studies in Second Language Acquisition, 5, 177–193.

Long, M. H. (1985a). Input and second language acquisition theory. In S. Gass & C. Madden(Eds.), Input in second language acquisition (pp. 377–393). Rowley, MA: NewburyHouse.

Long, M. H. (1985b). A role for instruction in second language acquisition: Task-basedlanguage training. In K. Hyltenstam & M. Pienemann (Eds.), Modelling and assessingsecond language acquisition (pp. 77–99). Clevedon: Multilingual Matters.

Long, M. H. (1989). Task, group, and task-group interactions. University of Hawai’i Workingpapers in ESL, 8, 1–26.

Long, M. H. (1990) The least a second language acquisition theory needs to explain. TESOLQuarterly, 24, 649–666.

Long, M. H. (1991). Focus on form: A design feature in language teaching methodology. InK. de Bot, R. B. Ginsberg, & C. Kramsch (Eds.), Foreign language research in perspective(pp. 39–51). Amsterdam: John Benjamins.

Long, M. H. (1996). The role of the linguistic environment in second language acquisition.In W. C. Ritchie & T. K. Bhatia (Eds.), Handbook of second language acquisition (pp.414–468). New York, NY: Academic Press.

Long, M. H. (1997). Construct validity in SLA research: A response to Firth and Wagner.The Modern Language Journal, 81, 318–323.

Long, M. H. (1998). SLA: Breaking the siege. Plenary address, PacSLRF 3, Tokyo, Japan:Aoyama Gakuin University, March, 1998. University of Hawai’i Working Papers in ESL,17(1), 79–129.

Long, M. H., Adams, L. McLean, M., & Castaños, F. (1976). Doing things with words –Verbal interaction in lockstep and small group classroom situations. In J. F. Fanselow &R. Crymes (Eds.), On TESOL ’76 (pp. 137–153). Washington, DC: TESOL.


Numa Markee

Long, M. H. & Crookes, G. (1992) Three approaches to task-based syllabus design. TESOLQuarterly, 26, 27–56.

Long, M. H. & Crookes G. (1993) Units of analysis in syllabus design: The case for task. InG. Crookes & S. M. Gass (Eds.), Tasks in a pedagogical context: Integrating theory andpractice (pp. 9–54). Clevedon: Multilingual Matters.

Long, M. H., Inagaki, S., & Ortega, L. (1998). The role of implicit negative feedback inSLA: Models and recasts in Japanese and Spanish. The Modern Language Journal, 82,357–371.

Long, M. H. & Porter, P. A. (1985). Group work, interlanguage talk and second languageacquisition. TESOL Quarterly, 19, 207–228.

Long, M. H. & Sato, C. (1983). Classroom foreigner talk discourse: Forms and functions ofteachers’ questions. In H. W. Seliger & M. H. Long (Eds.), Classroom-oriented researchin second language acquisition (pp. 268–285). Rowley, MA: Newbury House.

Lyster, R. (1998). Negotiation of form, recasts, and explicit correction in relation to errortypes and learner repair in immersion classrooms. Language Learning, 48, 183–218.

Lyster, R. & Ranta, L. (1997). Corrective feedback and learner uptake: Negotiation of formin communicative classrooms. Studies in Second Language Acquisition, 19, 37–66.

Mackey, A. & Philp, J. (1998). Conversational interaction and second-language develop-ment: Recasts, responses and red herrings? The Modern Language Journal, 82, 338–356.

Markee, N. (1994). Toward an ethnomethodological respecification of second languageacquisition studies. In E. Tarone, S. Gass, & A. Cohen (Eds.), Research methodologyin second language acquisition (pp. 89–116). Hillsdale, NJ: Lawrence Erlbaum.

Markee, N. (1995). Teachers’ answers to students’ questions: Problematizing the issue ofmaking meaning. Issues in Applied Linguistics, 6, 63–92.

Markee, N. (2000). Conversation analysis. Mahwah, NJ: Lawrence Erlbaum.Markee, N. (2003). Qualitative Research Guidelines (Conversation Analysis). In C. Chapelle

& P. Duff (Eds.), Some guidelines for conducting quantitative and qualitative researchin TESOL. TESOL Quarterly, 37, 169–172.

Markee, N. (2004). Zones of interactional transition. The Modern Language Journal, 88, 583–596.

Markee, N. (2005a). Conversation analysis for second language acquisition. In E. Hinkel(Ed.), Handbook of Research in Second Language Teaching and Learning (pp. 355–374).Mahwah, NJ: Lawrence Erlbaum.

Markee, N. (2005b). The organization of off-task talk in second language classrooms.In K. Richards & P. Seedhouse (Eds.), Applying conversation analysis (pp. 197–213).Basingstoke: Palgrave MacMillan.

McGroarty, M. (1998). Constructive and constructivist challenges for applied linguistics.Language Learning, 48, 591–622.

McHoul, A. (1978). The organization of turns at formal talk in the classroom. Language inSociety, 7, 183–213.

McHoul, A. (1990). The organization of repair in classroom talk. Language in Society, 19,349–377.

Mehan, H. (1979). Learning lessons: social organization in the classroom. Cambridge, MA:Harvard University Press.



Mehnert, U. (1998). The effects of different lengths of time for planning on second languageperformance. Studies in Second Language Acquisition, 20, 83–108.

Mori, J. (2002). Task design, plan, and development of talk-in-interaction: An analysis of asmall group activity in a Japanese language classroom. Applied Linguistics, 23, 323–347.

Muranoi, H. (2000). Focus on form through interaction enhancement: Integrating formalinstruction into a communicative task in EFL classrooms. Language Learning, 50, 617–673.

Nicholas, H., Lightbown, P., & Spada, N. (2001). Recasts as feedback to language learners.Language Learning, 51, 719–758.

Nunan, D. (1993). Task-based syllabus design: Selecting, grading and sequencing tasks. InG. Crookes & S. M. Gass (Eds.), Tasks in a pedagogical context: Integrating theory andpractice (pp. 55–68). Clevedon: Multilingual Matters.

Ohta, A. (2001a). Second language acquisition processes in the classroom. Mahwah, NJ:Lawrence Erlbaum.

Ohta, A. (2001b). Confirmation checks: A conversation analytic reanalysis. Paper presentedas part of the colloquium on Unpacking negotiation: A conversation analytic perspectiveon L2 interactional competence, Annual AAAL Conference, St. Louis, MI, March 2001.

Oliver, R. (1995). Negative feedback in child NS-NNS conversation. Studies in SecondLanguage Acquisition, 17, 459–481.

Oliver, R. (1998). Negotiation of meaning in child interactions. Modern Language Journal,82, 372–386.

Oliver, R. (2000). Age differences in negotiation and feedback in classroom and pair work.Language Learning, 50, 119–151.

Olsher, D. (2001). Negotiating plans of action in small group interaction. Paper presentedas part of the colloquium on Unpacking negotiation: A conversation analytic perspectiveon L2 interactional competence, Annual AAAL Conference, St. Louis, MI, March 2001.

Peck, S. (1978). Child-child discourse in second language acquisition. In E. Hatch (Ed.),Second language acquisition: A book of readings (pp. 383–400). Rowley, MA: NewburyHouse.

Peck, S. (1980). Language play in child second language acquisition. In D. Larsen-Freeman(Ed.), Discourse analysis in second language research (pp. 154–164). Rowley, MA:Newbury House.

Pica, T. (1987). Second language acquisition, social interaction and the classroom. AppliedLinguistics, 8, 3–21.

Pica, T. (1992). The textual outcomes of native speaker/non-native speaker negotiation.What do they reveal about second language learning? In C. Kramsch & S. McConnell-Ginet (Eds.), Text in context: Crossdisciplinary perspectives on language study (pp. 198–237). Lexington, MA: D. C. Heath.

Pica, T. & Doughty, C. (1985). The role of group work in classroom second languageacquisition. Studies in Second Language Acquisition, 7, 233–248.

Pica, T., Doughty, C., & Young, R. (1986). Making input comprehensible: Do interactionalmodifications help? IRAL, 72, 1–25.

Pica, T., Holliday, A., Lewis, L., & Morgenthaler, L. (1989). Comprehensible output as anoutcome of linguistic demands on the learner. Studies in Second Language Acquisition,11, 63–90.


Numa Markee

Pica, T., Kanagy, R., & Falodun, J. (1993). Choosing and using communication tasks forsecond language instruction. In G. Crookes & S. M. Gass (Eds.), Tasks and languagelearning: Integrating theory and practice (pp. 9–34). Clevedon: Multilingual Matters.

Porter, P. (1986). How learners talk to each other: Input and interaction in task-centereddiscussions. In R. R. Day (Ed.), Talking to learn: Conversation in second languageacquisition (pp. 200–222). Rowley, MA: Newbury House.

Robinson, J. D. (1998). Getting down to business: Talk, gaze and body orientation duringopenings of doctor-communication consultations. Human Communications Research,25, 97–123.

Robinson, P. J. (1995). Attention, memory and the ‘noticing’ hypothesis. Language Learning,45, 283–331.

Roger, D. & Bull, P. (1988). Introduction. In D. Roger & P. Bull (Eds.), Conversation (pp.21–47). Clevedon: Multilingual Matters.

Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest systematics for the organizationof turn-taking in conversation. Language, 50, 696–735.

Salica, C. (1981). Testing a model of corrective feedback. MA thesis, University of Californiaat Los Angeles.

Sato, C. (1986). Conversation and interlanguage development: Rethinking the connection.In R. R. Day (Ed.), Talking to learn (pp. 23–45). Rowley, MA: Newbury House.

Sato, C. (1988). Origins of complex syntax in interlanguage development. Studies in SecondLanguage Acquisition, 10, 371–395.

Schegloff, E. A. (1968). Sequencing in conversational openings. American Anthropologist, 70,1075–1095.

Schegloff, E. A. (1979). The relevance of repair to syntax-for-conversation. In T. Givón (Ed.),Syntax and semantics, Volume 12: Discourse and Syntax (pp. 261–286). New York, NY:Academic Press.

Schegloff, E. A. (1991). Conversation analysis and socially shared cognition. In L. R. Resnick,J. M. Levine, & S. D. Teasley (Eds.), Socially shared cognition (pp. 150–171). Washington,DC: American Psychological Association.

Schegloff, E. A. (1992). Repair after next turn: The last structurally provided defense ofintersubjectivity in conversation. American Journal of Sociology, 97, 1295–1345.

Schegloff, E. A. (1993). Reflections on quantification in the study of conversation. Researchon Language and Social Interaction, 26, 99–128.

Schegloff, E. A. (1997a). Third turn repair. In G. R. Guy, C. Feagin, D. Schiffrin, & J. Baugh(Eds.), Towards a social science of language: papers in honor of William Labov. Volume 2:Social interaction and discourse structures (pp. 31–40). Amsterdam: John Benjamins.

Schegloff, E. A. (1997b). Practices and actions: Boundary cases of other-initiated repair.Discourse Processes, 23, 499–545.

Schegloff, E. A. (2000). When “others” initiate repair. Applied Linguistics, 21, 205–243.Schegloff, E. A. & Sacks, H. (1973). Opening up closings. Semiotica, 8, 289–327.Schegloff, E., Jefferson, G., & Sacks, H. (1977). The preference for self-correction in the

organization of repair in conversation. Language, 53, 361–382.Schegloff, E. A., Koshik, I., Jacoby, S., & Olsher, D. (2002). Conversation analysis and applied

linguistics. In M. McGroarty (Ed.), ARAL, 22, Discourse and dialog, 3–31.



Schmidt, R. (1990). The role of consciousness in second language learning. AppliedLinguistics, 11, 17–46.

Schmidt, R. (1993). Awareness and second language acquisition. Annual Review of AppliedLinguistics, 13, 206–226.

Schmidt, R. (Ed.). (1994). Deconstructing consciousness in search of useful definitions forapplied linguistics. AILA Review, 11, 11–26.

Schmidt, R. & Frota, S. (1986). Developing basic conversational ability in a second language:A case study of an adult learner. In R. R. Day (Ed.), Talking to learn (pp. 237–326).Rowley, MA: Newbury House.

Schwartz, J. (1980). The negotiation for meaning: Repair in conversations between secondlanguage learners of English. In D. Larsen-Freeman (Ed.), Discourse analysis in secondlanguage research (pp. 138–153). Rowley, MA: Newbury House.

Seedhouse, P. (1997). The case of the missing “No”: The relationship between pedagogy andinteraction. Language Learning, 47, 547–583.

Seedhouse, P. (1999). The relationship between context and the organization of repair in theL2 classrooom. IRAL, XXXVII, 59–80.

Sharwood-Smith, M. (1991). Speaking to many minds: On the relevance of different typesof language information for the L2 learner. Second Language Research, 7, 118–132.

Sharwood-Smith, M. (1993). Input enhancement in instructed SLA: Theoretical bases.Studies in Second Language Acquisition, 15, 165–179.

Shehadeh, A. (1999). Non-native speakers’ production of modified comprehensible outputand second language learning. Language Learning, 49, 627–675.

Sinclair, J. & Coulthard, M. (1975). Towards an analysis of discourse. Oxford: OUP.Skehan, P. & Foster, P. (1997). Task type and task processing conditions as influences on

foreign language performance. Language Teaching Research, 1, 185–221.Spada, N. (1997). Form-focused instruction and second language acquisition: A review of

classroom laboratory research. Language Teacher, 30, 73–87.Spada, N. & Lightbown, P. (1993). Instruction and the development of questions in L2

classrooms. Studies in Second Language Acquisition, 15, 205–224.Stivers, T. (2001). Negotiating who presents the problem: Next speaker selection in pediatric

encounters. Journal of Communication, 51, 252–282.Swain, M. (1985). Communicative competence: Some roles of comprehensible input and

comprehensible output in its development. In S. Gass & C. Madden (Eds.), Input insecond language acquisition (pp. 235–253). Rowley, MA: Newbury House.

Swain, M. (1995). Three functions of output in second language learning. In G. Cook & B.Seidlhofer (Eds.), Principle and practice in applied linguistics: Studies in honour of H. G.Widdowson (pp. 125–144). Oxford: OUP.

Swain, M. & Lapkin, S. (1995). Problems in output and the cognitive processes they generate:A step towards second language learning. Applied Linguistics, 16, 371–391.

Tarone, E. & Liu, G. (1995). Situational context, variation, and second language acquisition.In G. Cook & B. Seidlhoffer (Eds.), Principles and practice in applied linguistics: Studiesin honour of H. G. Widdowson (pp. 107–124). Oxford: OUP.

Tomasello, M. & Herron, C. (1988). Down the garden path: Inducing and correcting over-generalization errors in the foreign language classroom. Applied Psycholinguistics, 9,237–246.


Numa Markee

VanPatten, B. (1988). How juries get hung: Problems with the evidence for a focus on formin teaching. Language Learning, 38, 243–260.

Varonis, E. M. & Gass, S. M. (1985a). Miscommunication in native/nonnative conversation.Language in Society, 14, 327–343.

Varonis, E. M. & Gass, S. M. (1985b). Non-native/non-native conversations: A model fornegotiation of meaning. Applied Linguistics, 6, 71–90.

Wagner, J. (1996). Foreign language acquisition through interaction: A critical review ofresearch on conversational adjustments. Journal of Pragmatics, 23, 215–235.

White, J. & Lightbown, P. M. (1984). Asking and answering in ESL classrooms. CanadianModern Language Review, 40, 228–244.

White, L. (1987). Against comprehensible input: The input hypothesis and the developmentof second language competence. Applied Linguistics, 8, 95–110.

White, L. (1989). Universal grammar and second language acquisition. Amsterdam: JohnBenjamins.

White, L. (1991). Adverb placement in second language acquisition: Some effects of positiveand negative evidence in the classroom. Second Language Research, 7, 133–161.

White, L., Spada, N., Lightbown, P., & Ranta, L. (1991). Input enhancement and L2 questionformation. Applied Linguistics, 12, 416–432.

Willey. B. (2001). Examining a “communications strategy” from a conversation analyticperspective: Eliciting help from native speakers inside and outside of word searchsequences. MA thesis, University of Illinois at Urbana-Champaign.

Williams, J. (1999). Learner-generated attention to form. Language Learning, 49, 583–625.Yule, G. & McDonald, D. (1990). Resolving referential conflicts in L2 interaction: The effect

of proficiency and interactive role. Language Learning, 40, 539–556.

JB[v.20020404] Prn:22/02/2006; 11:52 F: LLLT12P2.tex / p.1 (47-74)

Discussion


Generalizability

A journey into the nature of empiricalresearch in applied linguistics*

Lyle F. BachmanUCLA

In this response I discuss generalizability and some issues this raises aboutresearch in applied linguistics. I describe generalizability in terms ofinferential links from an observation to a report, to an interpretation, to theuses of that interpretation. In considering how to articulate, and supportwith evidence, these links, three aspects of generalizability need to beaddressed: consistency, meaningfulness and consequences.

Five dimensions of research can influence the ways in whichgeneralizability is addressed by different researchers: (1) the researcher, (2)the entity of interest, (3) the context, (4) the observation and report, and (5)the researcher’s interpretation. Consideration of these dimensions can helpcapture the richness and complexity of the varied approaches to empiricalresearch in applied linguistics.

A “research use argument” can guide research design and provide thejustification for the interpretations and uses of research. Adopting anepistemology of argumentation would move us away from one that is drivenby a preoccupation with our differences toward one that admits a variety ofresearch methodologies. It is clear that the use of multiple approaches inresearch is both feasible and desirable, in that it expands the methodologicaltools that are at the researcher’s disposal, thereby increasing the likelihoodthat the insights gained from the research will constitute genuine advances inour knowledge.

Introduction

After having read these very thoughtful and stimulating chapters, I must saythat my first reaction was one of being overwhelmed at the enormity of what weapplied linguists are presuming to accomplish. When I consider the conceptual


Lyle F. Bachman

conundrums over which we cogitate and the methodological mazes throughwhich we meander, I’m inclined to advise my students to go into some simplerendeavor, something less complex and relatively straight-forward, like rocketscience. After all, launching an electronic explorer on a trajectory to rendezvouswith a distant planet in 25 year’s time is a piece of cake compared to identifyingthe specific learning challenges for a given language learner, determining whatkinds of language use activities will provide the most effective interactions forhim or her, how a teacher can best implement these, and then assessing howmuch language that learner has learned after a program of instruction.

The questions that the chapter authors were asked to address had to dowith generalizability of results, appropriateness of inferences, and dependabil-ity in the research process, and the chapters in this volume consider thesenotions from a variety of research perspectives within applied linguistics. Su-perficially, one might conclude that these notions are essentially the sameacross these research approaches, simply playing themselves out differently atthe levels of methodology and analysis. However, probing more deeply, theconsideration of these questions forces us to delve into the epistemological un-derpinnings of our various approaches to research; indeed, they go to the verycore of our world views as researchers.

There are perhaps two propositions about which we all seem to agree.First, empirical research is aimed at the creation or construction of knowl-edge. Chapelle (this volume), for example, argues that “every study contributesindividual pieces to the store of professional knowledge” (p. 48), while Duff(this volume) states that the “aim of research is to generate new insights andknowledge” (p. 66). Second, there seems to be general agreement that we cre-ate this knowledge by observing phenomena. Thus Duff (this volume), statesthat “research in education and the social sciences has long been concernedwith the basis for inferences and conclusions drawn from empirical studies”(p. 65), while Swain (this volume) speaks of drawing “inferences from thedata collected” (p. 97). Similarly, with respect to language assessment, Mc-Namara (this volume) states that “it is fair to characterize all such researchaiming at providing evidence for or against the legitimacy of inferences fromtest scores, both qualitative and qualitative, as empiricist” (p. 166). However,despite these points of agreement, the chapters in this volume demonstratethat there are considerable differences in what counts as “knowledge”, in howwe go about creating it, and in the values and assumptions that underlie ourdifferent approaches to conducting empirical research.

The authors of the chapters in this volume address the issues of generaliz-ability from a variety of perspectives, and use the term “generalizability” in a


Generalizability

number of different ways. Most often, the term is used to refer to the inferen-tial link between an observation and an interpretation. However, it is also usedto refer to an inference from the observation of a single case to a larger group,and sometimes it refers to the dependability or consistency of the observationsthemselves. Several of the authors point out the limitations of viewing consis-tency as a desirable quality of observations, and discuss other qualities, such astrustworthiness, credibility, verisimilitude, authenticity, and comparability, asbeing equally, if not more, important.

In this response, I discuss what I see as the nature of generalizability, andsome of the issues this raises about the nature of research. I would hasten topoint out that, as might be expected in a response chapter, there is a someoverlap between what I discuss here and the discussions of generalizability inthe chapters in this volume. The chapters by Duff, Chapelle and McNamara,in particular, provide excellent overview discussions of many of the issues Iaddress here. Thus, what I discuss in this response chapter is partly a selec-tive discussion of issues they raise. At the same time, I believe that what Ipresent here reflects a slightly different perspective on issues of generalizabilityfrom the views presented in these chapters, as well as in the other chapters inthis volume.

I first lay out a broad view of generalizability in terms of logical links fromobserved performance to an observation report, to an interpretation, to the useand consequences of that interpretation. I argue that in addition to consider-ing generalizability as consistency, or reliability, across observations, and as themeaningfulness, or validity, of the interpretations we make from these obser-vations, we also need to consider the uses we make of our research results, andthe consequences of this use for various individuals who may be affected bythese uses.

I then discuss five dimensions of the research process that I believe deter-mine the ways in which these different aspects of generalizability are addressedby different researchers and in different research studies. I argue that these di-mensions need to be accounted for or at least considered when we address thegeneralizability of our research results. I also argue that these dimensions canhelp capture the richness and complexity of the varied approaches to empir-ical research in applied linguistics, and provide a way of getting beyond the“quantitative-qualitative, positivist-constructivist” division that has long beenrecognized as an oversimplification of the diversity of research approaches inapplied linguistics. I then describe a “research use argument” and propose thisas a basis both for guiding the conceptualization or design of research stud-ies in applied linguistics, and for providing the justification for interpretations


Lyle F. Bachman

and uses of research results. Finally, I offer some observations on the viabilityof combining multiple approaches to research in applied linguistics.

What is the nature of “generalizability”?

In most empirical research in applied linguistics, we are interested in link-ing observations of phenomena to interpretations. That is, we want to attachmeaning to our observations. In our field, these observations are typically oflanguage use, or the performance of language users. In addition to attachingmeaning to an observation of performance, in much applied linguistics re-search this meaning, or interpretation, is used, sometimes by the researcher,sometimes by others, to make a generalization or a decision, or to take an ac-tion that goes beyond the observation and its particular setting. Thus, in muchof our research in applied linguistics, we need to make a series of inferences thatlink performance to use. The inferential links between a performance and ouruse of that performance are illustrated in Figure 1, after Bachman (2004a).

The observed performance in the bottom box in this figure is an instanceof the observable phenomenon upon which we are focusing our research. Thisperformance might be, for example, a conversation, a “collaborative dialogue”,a group interaction among members of a speech community (e.g., family, class-room, applied linguists at conference presentation), a grammaticality judg-ment, or performance on an assessment task. The observation report in thenext box up, is the initial extraction, or filtering, by the researcher, of the perfor-mance that is observed. This observation report might consist of, for example,a score, a verbal description, a transcription, a picture or a video clip. This ob-servation report is inferred from the observation. This is because, in order toarrive at the observation report, the researcher makes decisions about what toobserve and how to record it. These decisions “filter” both the performanceto be observed out of all the possible performances that might be observed,and the observations themselves, since even the most detailed observation re-port will necessarily reflect the selective perceptions of the researcher. Otherresearchers might make different decisions, filter the performance differently,and hence arrive at, or infer, different observation reports.

The researcher arrives at the observation report by deciding what to ob-serve and by deciding to use observation and analytic procedures that areconsistent with and appropriate for the particular methodology that is em-ployed. These decisions about what and how to observe, and what and howto transcribe or report will be informed by the researcher’s own methodolog-


Generalizability

Consequences

Meaningfulnes

Consistency

ObservationReport 1:

e.g., score, verbaldescription,

transcription,audio, videorecording

ObservationReport 2:


transcription,audio, video

recording

ObservationReport n:


transcription,audio, video

recording

Interpretation

Use:Generalization,

Decision or Action

ObservedPerformance

Validity

Impact

Reliability

Figure 1. Inferential links from an observation to use (after Bachman 2004a:725)

ical and theoretical perspectives, or biases (Ochs 1979). With a conversationanalysis, this methodology will include the way in which the particular con-versation is selected and the medium with which it is recorded, as well as theprocedures employed to transcribe the details (e.g., words, non-verbal sounds,pauses) of the conversation. With an ethnography, this will consist of the rea-son for selecting a particular group for observation, the roles of the observerand the other participants, the medium with which the interaction is recorded,and the procedures for transcribing or otherwise making field notes about theverbal and non-verbal information in the interactions. For a language assess-ment, the methodology will consist of the how the assessment developer hasdefined the ability or attribute to be assessed, the kinds of assessment tasks thatare presented to the test takers, the administrative procedures, and the scoringmethod used. The rigor with which the observation and analytic process aredescribed and followed by the researcher provides the primary justification forthe link between the performance and the observation report.


Lyle F. Bachman

If we assume that a given observation report is but one instance of a num-ber of possible observation reports based on the performance, then we mightbe interested to know whether this particular report might generalize to a the-oretically infinite universe of all such reports, as illustrated in the three boxesin the middle of the figure. Other similar reports might be transcriptions of aconversation by different researchers, verbal descriptions of an interaction bydifferent observers or participants in a speech event, or scores obtained fromdifferent forms of a test aimed at measuring the same ability. Generalizabil-ity in this sense is related to consistency of observation reports, which is oftenassociated with the quality of reliability.

The next box up in the chain is the interpretation, or the meaning thatwe want to infer from our observation report. This interpretation might con-sist of the recognition of a recurring pattern in the observations, to which theresearcher might assign a label, such as “adjacency pair”, “knowledge of L2 vo-cabulary”, or “the ability to use the existential ‘there’ appropriately in English”.Inferring meaning from an observation report involves a broader type of gen-eralization: from very concrete verbal descriptions or narratives, visible images,or numbers, to an inferred meaning, which will generally be an abstract con-cept. Generalizability in this sense pertains to the meaningfulness, or validity,of the interpretation.

The top box is the use the researcher or consumers of the research resultsmake of the interpretation. The inference from interpretation to use typicallyconsists of a generalization to a larger group or to other situations, or a deci-sion, or an action or implication for action, that the researcher or consumersof the research results make on the basis of the interpretation. Generalizabil-ity in this sense pertains to the consequences, or impact, of the use of theinterpretation.

In summary, I would argue that in applied linguistics research, the linksbetween a performance, an observation report, an interpretation, and a use,consist essentially of a chain of logical inferences, or generalizations, from onelevel to the next. I would also argue that each inference in the chain mustbe supported both by logical arguments and empirical evidence in support ofthose arguments.1 I further argue that the inferential and evidentiary burdenincreases as the researcher’s purpose moves from that of describing the per-formance in the observation report, to interpreting that report, to using thisinterpretation in some way.

If the researcher’s objective is to provide a careful and detailed descriptionof the phenomenon, she will need to consider the consistency, or reliability ofthat description, with respect to how it might differ from those of other re-


Generalizability

searchers. Two conversation analysts, for example, might differ in how theytranscribe particular details of a conversation, and different participants in aspeech event might provide different reports of that event. Similarly, scores foran individual from two different elicitations of grammaticality judgments, orfrom two different forms of a language test intended to assess the same aspectof language ability, might differ. The extent to which such inconsistencies areconsidered problematic, and how they are resolved, differ across the differentresearch approaches that are commonly used in applied linguistics. In someapproaches, inconsistencies may be regarded as beneficial, in that they providediffering perspectives. The building up of multiple observations into a thickdescription in ethnography, for example, enables the researcher to report bothcommonalities and differences in the observations of different participants. Inother approaches, such as experimental or quasi-experimental designs, or lan-guage assessment, lack of reliability is generally considered to be a problem,and various statistical procedures have been developed to estimate this and tominimize it.

If the researcher’s purpose is to generalize beyond a particular observation,to interpret the observation report in a way that gives it meaning within theframework of an existing body of research or a theory, she needs to considerboth the consistency of the observation report and the meaningfulness of herinterpretation. She will need to articulate a logical argument that she can, in-deed, generalize from the observation report (or from multiple observationreports) to the intended interpretation, and then provide evidence to supportthis argument. The links described above provide the scaffolding she needs fordeveloping a logical argument that the meanings she attaches to her obser-vations are valid. The links also provide a guide for identifying the kinds ofevidence she needs to gather in support of this argument. As with consistency,the way the researcher provides convincing support for the meaningfulnessof interpretations varies from one research approach to another. In some ap-proaches, this consists essentially of achieving a well-grounded interpretationbased on multiple sources of information, while in others, it may also includethe statistical analyses of quantitative (numerical) data.

Finally, if the purpose of the research is to provide a basis for a decision oran action, or if it is highly probable that the research results may have implica-tions for actions that might be taken by the consumers of the research, then theresearcher needs to consider the consequences of the decision or action. She,or the consumers of the research, will need to articulate a logical argumentthat the interpretation of the observation report is relevant to the intendeduse, that it is actually useful for the intended use, that the consequences of the


Lyle F. Bachman

intended use are beneficial, and that the probability of potential unintendedadverse consequences is minimal.

In considering how to articulate, and support with evidence, the logicallinks between a given observation, an observation report, an interpretation ofthe observation report, and a use of that interpretation, there are three issuesthat need to be addressed: consistency, meaningfulness and consequences.

Consistency of observations and reports

Consistency has to do with the extent to which any given observation reportprovides essentially the same information, or generalizes, across different as-pects, or facets, of the observation and reporting procedure (e.g., instances,events, observers, ethnographers, categorizers, analysts, raters). When consid-ering the consistency of research results, we need to question both the desir-ability of consistency and the level of consistency that will be acceptable. Isconsistency of observation reports necessarily desirable? It is well-known thatreliability is an essential quality of measurements. If a score is not a consis-tent measure of anything, how can it be a measure of something? However,as Swain (1993) has pointed out, concerns with internal consistency may limitthe capacity of measures for providing information about the construct of in-terest.2 In the literature on writing assessment, some researchers view different“readings” of a composition by different raters as a positive outcome (Weigle2002). Likewise, the “rich description” of ethnography is built up by bring-ing together multiple perspectives of a given event, all of which are likely tovary in some details, as well as in their emphasis or stance. So whether consis-tency is viewed as desirable or necessary will reflect the researcher’s approachand perspective, as well as her purpose. If the researcher’s purpose is to inter-pret different observation reports, or scores, as indicators or descriptions ofessentially the same construct, as is the case with language assessment, thenconsistency among reports is essential. If, however, the purpose is primarilyto describe the phenomenon in all its richness, as in conversational analysisand ethnography, then variations in reports may be viewed as adding to andenhancing the description of the phenomenon.3

If consistency is considered desirable or necessary, then another questionthat we need to address is, “What level of consistency is acceptable?” How con-sistent, and in what ways, do the observations, or data collection proceduresneed to be? How consistent, and in what ways, do test scores, transcriptions,ethnographies, analyses, or reports of different observers need to be, in order


Generalizability

to be reliable, trustworthy, or credible? The answer to this question will de-pend, of course, on the importance and types of decisions or actions to betaken. If one is making potentially life affecting decisions about large numbersof people, as in high-stakes testing, we need to have highly reliable measures.However, if we are testing a theoretical hypothesis in research, how consistentdo our reports need to be? What if were are primarily interested in providing adescription of a particular phenomenon? How trustworthy does this descrip-tion need to be? The way we answer these questions will determine how we viewand operationalize generalizability as consistency in our research, whether thisis in the form of a statistical estimate of reliability, or a consensus of researcherobservations.

Meaningfulness of interpretations

When a researcher interprets her observations, she is, in effect, attaching mean-ing to these with respect to some domain of prior research or experience, ora theoretical position, in the field. Thus, the research may enrich her under-standing of general processes involved in interaction, or of co-constructingdiscourse, or it may lead to a new generalization about these. But while theresearcher may feel very confident about the meaningfulness of her interpre-tation, how do other researchers evaluate her claim of meaningfulness? Duff(this volume) provides an excellent discussion of meaningfulness, or validity,in quantitative and qualitative research, and I will thus provide only a briefsummary of these issues here.

Meaningfulness in quantitative research

Within quantitative approaches to research, meaningfulness has been concep-tualized from two perspectives: research design and measurement. In a classicarticle, Campbell and Stanley (1963) describe two types of meaningfulness, orvalidity, in experimental and quasi-experimental designs: external validity andinternal validity. External validity is what might be called generalizability orextrapolation from the results of a study based on a sample, to a populationor to other similar settings. External validity thus has to do with the extent towhich the research results generalize beyond the study itself. Internal validity,on the other hand, is the extent to which the causal connection between treat-ment and outcome that is inferred from the results is supported by featuresof the research design itself (e.g., randomization, distinctiveness of treatments,


Lyle F. Bachman

non-interaction between treatments). In measurement, validity refers to themeaningfulness and appropriateness of interpretations (Messick 1989). In thecurrent “standard” paradigm for measurement, validity is seen as a unitaryconcept, and evidence in support of the validity or meaningfulness of interpre-tations can be collected in a variety of ways (e.g., content coverage, concurrentrelatedness, predictive utility).4 Much current research in language testing isconducted within this paradigm, but as McNamara (this volume) points out inhis chapter, this view of validity is quite narrow, in that it is based on essen-tially quantitative, psychometric methods. However, recent argument-basedapproaches to conceptualizing validity in educational measurement have takena much broader view of validity, and provide for the inclusion of multiple typesof evidence, both qualitative and quantitative, to support a validity argument(e.g., Kane 2001; e.g., Kane 2004; Kane et al. 1999; Mislevy 2003; Mislevy et al.2003). Bachman (2005) has extended this argument-based approach so as tolink not only observations with interpretations, but also to provide a logicalmeans for linking interpretations to intended uses – the decisions or actionsthat may be taken on the basis of interpretations – and the consequences ofthese decisions or actions.

Meaningfulness in qualitative research

Within the approaches to research that are broadly referred to as “qualitative”,meaningfulness, or validity, is also of concern, although the way it is con-ceptualized and implemented in practice differs, understandably, from that ofquantitative research design and measurement. As with quantitative research,validity within the qualitative research tradition has been conceived from sev-eral different perspectives.5

Kirk and Miller (1999) define the issue of validity in qualitative researchas “a question of whether the researcher sees what he or she thinks he or shesees” (p. 21). In a similar vein, Lynch (1996), citing Maxwell (1992), definesvalidity as “the correspondence between the researcher’s ‘account’ of some phe-nomenon and their ‘reality’ (which may be the participant’s constructions ofthe phenomena)” (p. 55). Lynch discusses several different types of validity, asdefined by Maxwell: descriptive validity (the factual accuracy of the researchaccount), interpretive validity (how accurately the account describes what thephenomenon means to the participants – excluding the researcher), theoreti-cal validity (how well the account explains the phenomenon), generalizability(internal generalizations within the group being studied and external general-ization to other groups) and evaluative validity (how accurately the research


Generalizability

account assigns value judgments to the phenomenon) (Lynch 1996:55). Draw-ing on the research in naturalistic inquiry (Guba & Lincoln 1982) Lynch alsodiscusses validity in terms of “trustworthiness criteria”: credibility, transfer-ability, dependability and confirmability (1996:56). Lynch describes a vari-ety of techniques for assessing and assuring validity in qualitative/naturalisticresearch (e.g., prolonged engagement, persistent observation, negative caseanalysis, thick description, dependability audit and confirmability audit) (p.57). Lynch also discusses the use of triangulation, which he describes as “thegathering and reconciling of data from several sources and/or from differ-ent data-gathering techniques” (Lynch 1996:59). Citing the limitations of thenavigational metaphor of triangulation, Lynch discusses another metaphorfor this approach, attributed to Mathison (1988), that of detective work. Inthis metaphor, the researcher sifts through the data collected from multiplesources, looking not only for agreement, but also searching for examples ofdisagreement. In this way, triangulation is not a confirmation of the meaning-fulness through the collection of research results that converge, but is rather“the search for an explanation of contradictory evidence” (Lynch 1996:61).

In summary, meaningfulness, or validity, is of paramount interest and con-cern in both quantitative and qualitative/naturalistic approaches to research.For both research approaches, the essential question is the extent to which theresearcher’s interpretation of, or the meaning he attaches to the phenomenonthat is observed, can be justified, in terms of the evidence collected in theresearch study. In evaluating the meaningfulness of his interpretations, the re-searcher is, in effect, constructing a logical argument that his interpretationis justified, and collecting evidence that is relevant to supporting or justifyingthis interpretation. While the logic and structure of the validity argument maywell be consistent irrespective of the research approach used, the types of ev-idence collected and the methods for collecting these will vary considerablyacross different research approaches, and will be highly specific to any givenresearch study.

Consequences of decisions or actions for stakeholders

The research that we conduct as applied linguists may lead to a decision oraction, based on the meaning that is attached to the observation. These deci-sions or actions will have an impact on, or consequences for, different groupsof stakeholders.6 If, for example, the interpretation of our observation reportleads to placing in question a particular hypothesis or proposition that might


Lyle F. Bachman

be part of a theory or widely-held view in the field, it may result in a reformu-lation of the theory or view, or in its rejection by the field. This, in turn, mayaffect the research that is subsequently conducted, and that gets published byother researchers, or the ways in which the results of other research are inter-preted. In much of our research as applied linguistics, however, the intendeduse of the research is to find solutions to real world problems. This appliedpurpose is explicit in Larsen-Freeman’s intended uses of grammatical expla-nations that are derived from data, for example, “to inform the identificationof the language acquisition/learning challenge of language students” (Larsen-Freeman this volume, p. 1). It is also clearly implicit in Duff ’s statement, “whilethe research may speak to issues of how they [language learners] would engagein one particular type of task, it does not shed light on how curriculum can bedeveloped linking such tasks in meaningful, educationally sound ways” (Duffthis volume, p. 6).

Thus, in much applied linguistics research, the use we make of an interpre-tation takes the form of an action or an implication for action in the real world.These actions will have consequences for individuals beyond those whose per-formance was part of our research. For example, language educators may useour interpretations of classroom interactions, or of performance on grammati-cality judgments, to implement changes in language curricula and pedagogicalpractice. Or, a language program administrator might interpret a score on atest as an indicator of high language aptitude and on this basis place studentswith high test scores into an intensive language course. Or, test users may useour interpretations of performance on a language test to make decisions aboutcertifying individuals’ professional qualifications, about issuing them a visa, orabout hiring them. When our interpretations are used to make decisions oractions beyond the research itself, we need to extend our notion of generaliz-ability to include the consequences of the way we use our results. That is, weneed to consider the impact of our research on those who will be affected ourdecisions or implications. This is of particular importance when the lives ofother individuals, such as language learners, teachers or students, or specificgroups of language users, may be affected by the uses that are made of ourresearch results.

The researcher’s responsibility for consequences

If the results of our research do, indeed, have consequences, or impact, be-yond the research itself, then it would be reasonable to consider the questionof the extent or limit of the researcher’s responsibility for the way research re-


Generalizability

sults are used. It is in the field of language assessment, where language testersare most often held accountable for the ways in which the results of their testsare used, that the issues of responsibility for consequences has been discussedmost extensively. In the past decade or so there has been an expanding discus-sion of ethics and consequences in the language testing literature, and a fullreview of this is neither feasible nor appropriate here.7 Rather, I will focus onthe discussions of responsibility that seem to be most relevant and generaliz-able to research in applied linguistics. Davies (1997a) and Hamp-Lyons (1997a)both discuss the language tester’s (researcher’s) responsibility for the way theresults of a given test (research study) are used. Hamp-Lyons (1997b) arguesthat language testers have responsibility “for all those consequences which weare aware of” (p. 302), leaving unclear what responsibility the language testerhas for making herself aware of the possible consequences of test use. Davies(1997a) similarly avoids the issue of the tester’s responsibility for anticipatingpossible misuses of tests when he states that language testers should be will-ing to be held responsible for “limited and predictable social consequences wecan take account of and regard ourselves responsible for” (p. 336). But whatabout possible unpredictable consequences? Bachman (1990, 2004a) takes amore proactive stance toward the language tester’s responsibility for anticipat-ing unintended consequences. Bachman (1990), argues that the language testdeveloper should “list the potential consequences, both positive and negative,of using a particular test” and then “rank these [potential consequences] interms of the desirability or undesirability of their occurring” (p. 283). Bachman(2004a) takes this a step further, arguing that specific statements about possi-ble unintended consequences need to be articulated as part of an assessmentuse argument by the test developer. These discussions of consequences and theresponsibility of the language test developer are, I believe, directly relevant toother kinds of research in applied linguistics.

If we accept this line of reasoning, then researchers in applied linguisticsshould articulate, as part of their research design, an argument supporting thepotential uses of their research results, in which they indicate what they be-lieve would be appropriate uses and beneficial consequences of their researchresults, and what they would consider inappropriate, or would have negativeconsequences for different groups of stakeholders. It is currently quite com-mon for funding agencies to require researchers to provide a statement of thepossible impact of their research on various individuals, groups, or institutions.What is not so common, is asking researchers to anticipate what the possiblenegative impacts of their research might be.


Lyle F. Bachman

Dimensions of applied linguistics research

I have described generalizability as comprising consistency, meaningfulnessand consequences, and have argued that this concept applies to all empiricalresearch in applied linguistics. However, the way in which generalizability isconsidered will vary from one empirical study to another, and from one re-searcher to another. In order to better understand the ways in which issues ofgeneralizability are addressed in any given empirical study, I believe it is usefulto discuss such research in terms of five dimensions. I would argue that the wayin which a particular study or researcher deals with generalizability is a func-tion of where the research and researcher is situated with reference to thesedimensions in research. The five dimensions that I will discuss are: (1) the ob-server/researcher, (2) the entity of interest, (3) the context, (4) the observationand report, and (5) the interpretation. These dimensions are summarized inthe Table 1.

Table 1. Dimensions of research in applied linguistics

I. The observer/researcher

A. What is the researcher’s view of the world and perspective on knowledge?B. What is the researcher’s purpose for conducting the research?C. Who are the researcher’s intended audiences?

II. The entity of interest: the “construct”

A. What are constructs and where do they reside?B. Where do constructs come from?C. Underspecification and grain-size

III. The nature and role of “context”

A. What is the relationship between the context and the entity/construct?B. What is the range or scope of context?

IV. The observation and report

A. What counts as an “observation”?B. What is the unit of analysis?C. How is the observation reported?

V. The researcher’s interpretation of the observation

A. What is the researcher’s ontological stance towards the observation?B. What is the relationship between observations and interpretations?


Generalizability

The observer/researcher

The researcher, as observer, will bring to the research enterprise, either ex-plicitly or implicitly, a set of philosophical assumptions about the nature ofthe world (ontology), and about the nature of our knowledge of that worldand about how we acquire that knowledge (epistemology). The researcher willalso have a purpose for conducting the research, and one or more audiences towhom she intends to report her results.

What is the researcher’s view of the world and perspective on knowledge?Researchers’ views of the nature of the entities that exist or that may exist differin terms of how they see the relationship between the mind and matter – whathas been called the “mind-body problem” (see citations below for monism anddualism). This problem can be articulated as a question: “How is it possible fortwo entities that appear to be distinct – one apparently physical and the otherapparently not – to interact, as our minds seem to do with our bodies?” Statedmore broadly, in terms of empirical research, the question becomes, “How is itpossible for two distinct entities, the mind of the observer and the object of theobservation, to interact, as researchers seem to do with the phenomena theyobserve?” The way the researcher answers this question leads to two very dif-ferent ontologies: monism and dualism. Monism holds that mind and matterare essentially the same. This view has been associated with philosophers suchas Parmenides, Heraclitus, Democritus, the Stoics, Spinoza, Berkeley, Hume,and Hegel, and has been articulated more recently in the work of Churchland(1996), Damasio (1994) and Dennett (1991). The opposing view is that of dual-ism, which holds that mind and matter are fundamentally different. This viewhas been associated with philosophers such as Plato, Kant, and Descartes, andhas been articulated more recently in the work of Chalmers (1996) and Hart(1988).8

The ontological view that the researcher adopts has clear implications forepistemology, as well. One implication of a monist view for research is thatthere is no “objective reality” that is distinct from the researcher, and that canform the basis for our understanding of the world. Thus, the researcher is nec-essarily a part of the phenomena he wants to investigate. Since each observermay experience his own reality, a corollary of this view is that the researcheraccepts the possibility of multiple realities, all of which are equally “true”. Adualist ontological view, on the other hand, leads to the epistemological posi-tion that the researcher is separate from an objective reality, and that this realitycan be discovered by the researcher through the process of observation.


Lyle F. Bachman

Another distinction that is often discussed in research is that between “etic”and “emic” perspectives. An emic perspective is the “insider’s” perspective,associated with a participant-oriented approach to knowledge construction.9

An etic perspective, on the other hand, is the “outsider’s” perspective, and istypically associated with a researcher-oriented approach to knowledge con-struction. This distinction has implications for epistemology, if we extend it toclaims about knowledge. Conceived in this way, “emic knowledge” representsclaims or interpretations expressed in terms that are considered meaningfuland appropriate by the members of the particular group that is being observed.The researcher’s attention to the emic perspective on knowledge is illustrated,I believe, by Markee’s chapter in this collection, when he states explicitly that“the warrant for any analytic claims that are made about how ordinary con-versation and institutional talk are organized must be located in the localcontext of participants’ talk, and explicated in terms of what members un-derstand each other to be doing at the time that they are doing it” (p. 12).“Etic knowledge”, on the other hand, is the “outsider’s” perspective, and rep-resents claims or interpretations expressed from the perspective of acceptedpractice and terminology in a particular research community. Generalizabil-ity of emic knowledge depends largely on consensus among the participants inthe speech event, who generally seek to agree that the interpretation is conso-nant with their shared understanding. Generalizability of etic knowledge, onthe other hand, is supported through logical argumentation and the collectionof supporting evidence, ideally from multiple sources.

Both perspectives, it seems to me, are illustrated in Swain’s chapter. Theemic perspective is evident in her use of stimulated verbal protocols to elicitparticipants’ own perceptions of reformulations of stories they have written.The etic perspective, on the other hand, seems to inform her interpretationsof these verbalizations as instances of language use – speaking, in this case –mediating language learning. Similarly, both perspectives are implicit in Duff ’sstatement that qualitative research in applied linguistics “seeks to produce anin-depth exploration. . . in some cases, of participants’ and researchers’ ownpositionality and perceptions with respect to the phenomenon” (p. 8).

What is the researcher’s purpose for conducting the research?Research is conducted for a range of purposes. One purpose is to describe thephenomenon so as to expand our knowledge or understanding of it. Duff,for example, drawing on Stake’s work, says that the purpose of an intrinsiccase study is “not to come to understand some abstract construct or genericphenomenon. . . The purpose is not theory-building” (p. 26). Rather, the in-


Generalizability

trinsic case study is of interest in its own right. Another purpose is to induceone or more general statements that might provide a basis for linking thephenomenon observed to other, similar phenomena, or to the phenomenonobserved in other contexts, or in other individuals or groups. Here, the re-searcher goes beyond a description of the phenomenon itself, and attempts toattach meaning to it by generalizing to other individuals or groups of individu-als, or to other contexts. An example of this inductive purpose, I believe, can beseen in Duff ’s depiction of the goal of applied linguistics research: “Rather thanunderstand the phenomenon in terms of its components parts, the goal is tounderstand the whole as the sum or interaction of the parts” (Duff this volume,p. 8). Another purpose of research is to explain the phenomenon, or predict itsoccurrence or operation in other contexts. This generalization may or may notbe related to a particular theory about the phenomenon, but might provide abasis for theory-building. When explanation or prediction is the researcher’spurpose, the phenomena to be observed is typically identified on the basis ofinductive generalizations from previous research or experience. Another pur-pose is to use the interpretation to inform a theory about the nature of thephenomenon. Chapelle (this volume), for example, discusses linking scores ona vocabulary test to “a broader theoretical construct framework of vocabularyknowledge” (p. 9). In some research, this linkage may entail decisions aboutthe extent to which the theory itself is either supported or placed into question.As McNamara (this volume) points out, “investigation of the relationship be-tween test construct [theory] and test performance cuts both ways: test data. . .can also be used to question the constructs on which our inferences have beenbased” (p. 8). If theory falsification is the researcher’s purpose, the specific phe-nomenon to be observed will be selected on the basis of the kinds of hypothesesthat can be deduced from the theory.

In addition to these various specific purposes, the researcher may have anoverarching goal in conducting the research. One goal, which is typically asso-ciated with the publication of research results in scholarly journals, is to sharethe research results with the community of scholars. Since research is never aneutral, objective undertaking, an implicit goal in much published research isto convince other researchers that the findings of the research provide impor-tant insights into the phenomenon. Another goal that is typically associatedwith the “applied” aspect of our discipline, is to use the research results to in-fluence decisions or actions in the “real world” about people or programs. Thiswas discussed above in connection with consequences of research.


Lyle F. Bachman

Who are the researcher’s intended audiences?When we report the results of our research, the way we report it will depend,to a large extent, on the audience with whom we want to communicate. Oneaudience consists of other members of a particular community of researcherswho have been conducting research in the same area as the study we wish toreport. Published reports or articles for this audience are likely to be the mostfocused, and include the most detailed information about the design and pro-cedures followed in the study, the kinds of observations made and reported,and how these were analyzed and interpreted. Articles for this audience aretypically published in journals that focus on a specific area of applied linguis-tics, such as language assessment (e.g., Language Testing, Language AssessmentQuarterly), second language acquisition (e.g., Language Learning, Studies inSecond Language Acquisition, Second Language Research), or discourse analysis(e.g., Research on Language and Social Interaction, Discourse Processes, Journal ofPragmatics). Another audience consists of the members of a larger communityof scholar/researchers. Articles for this audience may address the broader im-plications of the research results, and are typically published in journals thatpublish research from a range of areas within a field (e.g., Applied Linguis-tics, International Review of Applied Linguistics in Language Teaching, Issues inApplied Linguistics, TESOL Quarterly).

In addition to the various research communities with whom we share ourresearch results, there is another, more public, audience, who might be referredto as the “consumers” of our research. These include people and agencies thatare in power to make decisions in the real world on the basis of our research.This audience includes individuals who set educational policy at various levels,as well as practitioners; both of these groups may want to use research resultsto inform policy or practice. In some cases, research results may inform publicopinion and influence actions taken by the society at large.

Finally, the results of our research, in the form of publications, researchreports and grant applications, are evaluated by committees that consider ourresearch “track record”. This audience includes agencies and committees thatreview grant proposals, as well as committees that evaluate our work with re-spect to granting tenure and promotion. Thus, any given research report thatis published is likely to be read by more than one of these audiences.

The entity of interest: The “construct”

When a researcher observes some phenomenon in the real world, he generallydoes this because he wants to describe, induce or explain something on the


Generalizability

basis of this observation. That something is what can be called a “construct”.When Swain (this volume), for example, asks the question, “What do verbalprotocols represent?”, she is implicitly looking for an explanation of the phe-nomenon captured in her verbal protocols in terms of one or more constructs.Constructs are useful in research because they enable the researcher to distin-guish the entities that he describes, induces or explains from other entities.I would argue that all research in applied linguistics entails the definition ofone or more constructs, even if this is implicit. For example, if a researcherwants to describe a conversation, he needs to start with a definition of whatdistinguishes this from other types of oral language use, such as an interview, alecture, or a TV news report. He may also decide to focus only on face-to-faceconversations, as opposed to, say, phone conversations, and this, I would argue,also implies an operational definition of the construct, “conversation”. Or, if aresearcher wants to measure an individual’s knowledge of vocabulary, he needsto define this construct in a way that will distinguish it from, say, knowledge ofsyntax and cohesion. When we define and specify what it is we intend to de-scribe, induce or explain, we are, in effect, constructing a definition of this forthe specific purpose of conducting research. And when we construct such a def-inition, we create a “construct”. Constructs then, are the entities that we wantto describe, induce, or explain, on the basis of our observations of phenomena.The meanings, or interpretations that we infer from our observation reportswill be stated in terms of these constructs. As will be seen below, the entitiesthat researchers define as constructs can include the traits, or attributes of thelanguage users, features of the contexts, or interactions between language usersand contexts. I would note that Chapelle (1998, this volume) and McNamara(this volume) provide excellent discussions, from the perspective of languageassessment, of the role of constructs in research, and of different approaches todefining them.

What are constructs and where do they reside?The nature of constructs, and where these are believed to “reside” are two issuesthat are controversial in many areas of applied linguistics. From the perspec-tive of research methodology, the way in which the researcher views these issuesis one dimension that characterizes her research approach, and will inevitablybe related to her view of the world, as discussed above. Thus, from a monistontology, the researcher would view the construct as part of a reality withinwhich she is included, while from a dualist perspective, the researcher wouldview the construct as part of an objective reality that is essentially indepen-dent of herself, as researcher. Drawing on Messick (1981), and working within


Lyle F. Bachman

an implied dualist ontology, Chapelle (1998) has described three alternate ap-proaches to defining constructs, in terms of how researchers interpret consis-tencies across different observations of phenomena. In a trait perspective, con-sistencies across observations are attributed to characteristics of language users,typically as knowledge, ability, attitudes, feelings, or cognitive processes. In abehaviorist perspective, consistencies in observations are attributed to charac-teristics of the context. Finally, in an interactionalist perspective, consistenciesacross observations are interpreted as reflecting the interactions between traitsand contexts.

Trait perspective. From a trait perspective, the researcher would view the con-struct as a characteristic of the individual language user that can be inferredfrom consistencies in performance across different observations. From this per-spective, the construct “resides” in the individual language user. Conversationalturns, adjacency pairs, and repairs in conversation analysis, for example, couldbe viewed from a trait perspective in terms of the capacity that individual lan-guage users have that enables them to perform these aspects of conversationalinteraction. This is the perspective that Bachman and Palmer (1996) take, forexample, when they include “knowledge of conversational organization” as anarea of language knowledge. Similarly, Bachman and Palmer view constructssuch as morphosyntax, the semantics and pragmatics of a grammatical struc-ture, or patterned use of grammatical features in texts (Larsen-Freeman thisvolume) as areas of language knowledge that reside in the language users.

Behaviorist perspective. From a behaviorist perspective, the researcher wouldidentify the construct with the consistencies in the phenomena that are ob-served, in which case the construct comprises the characteristics of the phe-nomena, or could be said to “reside” in the phenomena themselves. Conversa-tional turns, adjacency pairs, and repairs in conversation analysis, for example,could be viewed from a behaviorist perspective essentially as consistencies inthe phenomena that the conversational analyst observes. Thus, when Markee(this volume) points out that the construct, “conversational repair” occursmore or less frequently, depending, for example, on the type of task and thetypes of interactants, it suggests that this construct is a characteristic of conver-sations, which would imply a behaviorist perspective. Similarly, morphosyntax,the semantics and pragmatics of a grammatical structure, or patterned use ofgrammatical features in texts (Larsen-Freeman this volume) could be viewedfrom a behaviorist perspective as constructs that reside in the language use


Generalizability

that is observed by the researcher, through introspection, direct observation,or through the analysis of language corpora.

Interactionalist perspective. From an interactionalist perspective, conversa-tional turns, adjacency pairs, morphosyntax, the semantics and pragmatics ofa grammatical structure, and so forth, could be seen as constructs that are theproducts not solely of either the interactants or of the context, but of the in-teraction between language users and the context. These constructs might beinduced through generalization from the observation and analysis of a widerange of conversational or textual discourse phenomena, and that can thus beexpected to occur in conversations and other discourse in general. Thus, con-structs such as these can also be viewed from an interactionalist perspective asabstract constructs that the conversation or grammar analyst induces from theobservations of language use.

Any particular researcher may, either implicitly or explicitly, adopt one ofthese three perspectives, which will then inform the way he interprets his obser-vations. However, it is important to also understand that different researchersmay view essentially the same phenomena from different perspectives and thusinterpret them in different ways. Thus, while Bachman and Palmer (1996) maymake an inference about the language ability of language users from consisten-cies in observations of conversational repairs in an oral interview or a groupdiscussion task, it would appear that most conversational analysts would viewthese consistencies as aspects of the conversations in which they occur. Simi-larly, it would appear that Larsen-Freeman is interpreting the phenomena sheobserves in language use as qualities of the texts or discourses in which theyoccur, rather than as areas of knowledge that language users have.

A variety of constructs, in addition to those mentioned above, are dis-cussed in the chapters in this volume. Coming largely from the field of languageassessment, Deville and Chalhoub-Deville, Chapelle, and McNamara providediffering perspectives on how the construct “language ability/proficiency”, orsome aspect of this, might be defined. Deville and Chalhoub-Deville first de-scribe the “standard” definition of language ability as a fairly stable abilitywithin an individual. Defining constructs this way is essentially the trait per-spective on construct definition. However, Deville and Chalhoub-Deville go onto point out the person by task/situation interactions that are typically found inresearch in measurement, and propose a different, interactionalist, construct,“ability-in-language user-in-context”. This construct “does not reside in thehead of the test taker, but is bound inextricably to the interaction of the personand the task/context” (p. 14). Chapelle (this volume) discusses several different


Lyle F. Bachman

dimensions, or traits, that have been proposed for the construct, “vocabu-lary knowledge”, e.g., vocabulary size, organization of L2 lexicon, knowledgeof morphological characteristics, and vocabulary strategies, any or all of whichmight be included in a broad theory of vocabulary knowledge. She also al-ludes to a construct, “vocabulary ability”, which would include both vocabularyknowledge and “the processes for putting the knowledge to use” (Note 1), andwhich might be considered an interactionalist definition of this construct. Mc-Namara provides a much more general discussion of the relationships amongconstructs, a “criterion” domain, and test performance. Drawing on differenttheoretical perspectives on the relationship between cognition and language,Swain (this volume) interprets the phenomena she observes – language learn-ers’ verbal protocols of reformulating a story they have written in the targetlanguage – in terms of a construct, “mental forms of activity” (p. 13).

Where do constructs come from?Researchers arrive at construct definitions in a variety of ways. As I’ve ar-gued above, constructs are always implied by what, where, when and how theobserver chooses to observe – by where the researcher chooses to point hercamera, and what kind of camera she chooses to use, so to speak. More explic-itly, constructs may be induced by generalizing from experience or observation.That is, constructs can be derived from the data. Markee’s discussion of theconstruct, “conversational repair” is, I would argue, and example of a con-struct that is induced from observation. Another source of constructs is theory.Thus, McNamara (this volume) asserts, at the beginning of his chapter, that inlanguage assessment, “reasoning about the construct. . . draws on theories oflanguage knowledge and language performance, which are essentially cogni-tive” (p. 6). In the latter part of his chapter he argues that language testers needto consider the possibility of expanding the ways in which constructs are de-fined to include both the socio-interactional aspects of language ability, as wellas the political ends which language assessments are used to support.

Swain’s (this volume) chapter provides an enlightening example of the roleof theory in defining constructs, and how the results of empirical observationcan serve to clarify theory. Swain traces the implications of two differing the-ories of mind – cognitivist and sociocultural – for how one would interpretthe results of verbal protocols of learners’ reformulations of a story, as reportedin a collaborative dialogue. What I find most insightful, in terms of generaliz-ability, about Swain’s analysis is that although the two theoretical perspectivesare clearly distinct, in terms of how they view the relationship between lan-guage and mind, and the claims they might make about this, she concludes


Generalizability

that “the inferences one anticipates drawing from verbal protocols are not dis-similar across these two theoretical perspectives” (p. 17). From this she reachesa higher generalization about the similarity between these two theoretical per-spectives on the mind: “Both aim to develop claims about the higher mentalprocesses participants make use of in carrying out a specific task” (p. 17).

Underspecification and grain-sizeTwo issues arise in defining constructs: underspecification and grain size. Thelanguage phenomena that we observe in applied linguistics research are richand complex. In addition, what we observe is often but a fleeting moment inthe span of someone’s life of language use, or a single instance that is part ofsome larger phenomenon we want to describe or explain. Even though we maybe able to capture this moment with a high degree of technical accuracy invideo and audio media, our descriptions of the phenomenon will necessarilybe limited in two ways. First, the observer may not perceive all of the elementsof the speech event, no matter how many times he replays the video. Even ifhe were to perceive all the physical details, he may still miss the significance ofsome of these details as they are perceived by and affect the participants in thespeech event. Because of this, any constructs that the researcher may inducefrom his observations may be underspecified, or incomplete, to some degree.That is, the researcher may unwittingly omit or overlook a particular featurethat is of relevance to the description or explanation. Similarly, when a raterassigns a score to a test taker’s performance on a language assessment, she mayinadvertently miss some critical aspect of that performance that might warranta different score. Another rater might pick up on this aspect, while missing orignoring other features of the performance.

Second, in much research, the researcher chooses to be selective, attend-ing only to those features that he believes or hypothesizes to be distinctive in away that is relevant to his research questions. Thus, in order to focus on the as-pects of the phenomenon that are of interest, the researcher typically makescertain simplifying assumptions that determine, to a large degree, what heobserves and when, where and how he makes his observations. When a conver-sation analyst or ethnographer chooses a particular speech event or communityto observe, or when a functional grammarian chooses a particular linguisticstructure to examine, and selects a particular linguistic corpus or elicitationprocedure, she is selecting or simplifying the range of phenomena to be in-cluded in the research. Similarly, when a researcher extracts one or more scoresfrom an individual’s performance on a language assessment, he is focusing onlythose specific aspects that performance that will enable him to make the kinds


Lyle F. Bachman

of interpretations about ability that he is interested in, and hence simplifyingthe range of language performance he needs to observe. Another example ofselectivity, or simplification, is when the researcher places the limitations ofexperimental design on the phenomena and artificially manipulates the treat-ment, or experience, which research participants undergo. I thus would arguethat virtually all research in applied linguistics involves selectivity or simplifi-cation, so that our descriptions of phenomena are necessarily incomplete andunderspecified to some degree.

A second issue is that of grain-size, or the level of detail we capture inour descriptions of phenomena. How much detail does our description ofthe speech event need to include? If, for example, a researcher is interestedin describing the acoustic features of speech, she will need a very high qualityrecording and precise instruments to display the acoustic characteristics of thesound waves that are generated in the speech event. Here, she may need to mea-sure time in a way that is accurate to nanoseconds. Another researcher, on theother hand, might be interested in doing a conversational analysis of speech.This researcher can most likely forgo an acoustic analysis of the sound waves,and may find that timings in seconds are sufficiently accurate, but will never-theless want a very accurate transcription of the speech, including pauses andnon-verbal utterances. A third researcher might be interested in understand-ing the strategies that a test taker reports using while taking a speaking test.In this case, she might not transcribe the speech at all, but might make notesabout what processes or strategies the test taker reports using. To give anotherexample, if a researcher wanted to know how well a student can use grammarin writing, it might be sufficient to obtain a single rating of grammar basedon a written composition. If, on the other hand, he wanted to diagnose areasof strength and weakness in grammatical knowledge, he would need to obtainmore specific information about different aspects of grammatical knowledge,such as knowledge of propositions, tense and aspect markers, or agreement ingrammatical gender. Thus, the grain-size – the amount and level of detail –that is included in the researcher’s description will depend, to a large extent, onhow broadly or narrowly he defines the construct.

The nature and role of context

“Context” is a comfortable, albeit slippery term in applied linguistics;10 we alluse the term, and know what it means, yet we don’t all agree on precisely what,where or even when it is. If the researcher adopts a monist ontology, then thenotion of context becomes largely irrelevant, since the observer, the entity, or


Generalizability

construct to be described, induced or explained, and the context are all part ofthe same phenomenon, or reality, and it makes little sense to try to delineatewhere one ends and the other begins.11 This monist view can be illustrated asin Figure 2 below.

Entity/ConstructContext

Figure 2. Monist view: Observer, the entity/construct and the context all part of phe-nomenon

It is this ontological view that is implicit, I believe, in ethnographic researchthat aims at providing a rich description, or ethnography, of the phenomenon,and takes an “emic” view of this. In such research the entity/construct andthe context are essentially indistinguishable, and the researcher’s role is thatof “insider”, or participant, who is part of the phenomenon to be described.This view seems to be implicit in much research in sociolinguistics, and is thesource of the observer’s paradox, the methodological conundrum discussed byLabov (1990).

Much research in applied linguistics, however, is conducted within a dualistontology, so that the observer is essentially extracted from the context and theentity that is of interest. In this view, the researcher makes observations of thephenomenon as it happens in an objective reality in order to come to some un-derstanding of the entity of interest. Distinguishing the entity/construct fromthe context, however, as well as delineating the relationship between these,becomes potentially problematic. One often-cited approach to describing con-text is Hymes’ (1974) well-known “SPEAKING” acronym for remembering thefeatures of a speech event (Setting, Participants, Ends, Act sequence, Key, In-strumentalities, Norms, Genres). In the area of language assessment, Bachmanand Palmer (1996) have proposed a set of characteristics that can be used to de-scribe language use tasks and language assessment tasks. Neither of these sets of


Lyle F. Bachman

characteristics, however, recognizes the dynamic, changing nature of context,as described by Erickson and Schulz (1981). From these perspectives, contextis essentially the external setting, or situation, in which the phenomenon takesplace, and in which the observer observes it. One of the most inclusive defini-tions of context I have seen is that of research in language education settings:“the larger sociopolitical or historical context (where relevant), the participantsand their interests, the tasks or instructional practices used and the partici-pants’ understandings or views of these, in some cases, and how the researchitself, whether inside the classroom or in a research office of some kind, cre-ates a special sociolinguistic context, system or ecology that is temporally aswell as socially and discursively situated” (Duff this volume, p. 12, citing vanLier 1988, 1997). In a dualist ontology, context may thus be viewed as relativelystatic, or as dynamic and changing, but it is nevertheless clearly distinct fromthe observer. Within this dualist perspective, in whatever way we may chooseto define context for a particular study, the critical issue that needs to be ad-dressed is that of the relationship between that context and the entity to bedescribed, induced or explained.

What is the relationship between the context and the entity/construct?One dualist view of the relationship between the entity/construct and the con-text is that these are essentially inseparable. In this view, the researcher treatsthe language use activity or speech event as a unitary construct, describing bothwhat the language users say and do and the features of the context or situationin which the speech event occurs. This view is illustrated in Figure 3 below.

This ontological view that is implicit, I believe, in much ethnographicresearch that investigates the ways in which language functions in the social-ization of children, and in the creation of a sense of community, through theco-construction of discourse. In such research, the entity and the context are

Entity/Construct

Context

Figure 3. Dualist view: Observer independent of the phenomenon; entity/constructand context not distinguished


Generalizability

essentially indistinguishable, and the researcher’s role is that of observer of thephenomenon (e.g., Ochs 1988; Ochs & Capps 2001).

Another dualist view would be to treat the entity/construct and the contextas separate and essentially unrelated, in which case the research focuses solelyon describing the entity/construct itself, independently of the context. This isthe perspective that characterizes “trait” approaches to defining the construct.This view is illustrated in Figure 4 below.

Åíôéôø/×ïí

óôñõ÷ô

×ïíôåîô

Figure 4. Dualist view: Entity/construct exclusive of context

An example of this would be giving a language assessment or eliciting gram-maticality judgments in which the tasks are entirely decontextualized, and thenscoring the research participants’ performances on these tasks and interpretingthese scores independently of either the context in which they have respondedto the task, or the language use context or linguistic structure to which theresearcher wants to generalize.

A much more typical dualist view in applied linguistics research, I believe,is an interactionalist one. This view entails the presupposition that the en-tity/construct that is of interest exists or happens within a context, and thatthese interact in ways that mutually affect each other. In this view, it is the in-teraction itself that is the entity of interest. This dualist interactionalist view isillustrated in Figure 5 below.

An example of this interactionist view, I believe, is Swain’s (this volume)analysis of verbal protocols, which, she concludes, need to be considered aspart of the phenomenon – the interactions among participants in the speechevent and the cognitive changes that these bring about. This perspective alsounderlies the interaction hypothesis in SLA research (discussed by Markee thisvolume). Another example of an interactionalist perspective on context is thatproposed by Douglas and Selinker (1985), who have argued that in additionto context as an external milieu in which speech events happen, and withwhich individual language users interact, language users utilize what Douglas


Lyle F. Bachman

contest

Entity constgruct

/

Figure 5. Dualist interactionalist view: Entity/construct and context separate but in-teract with each other

and Selinker call discourse domains, which are essentially personalized internalcontexts that are created through language use.

The way a researcher views the relationship between the entity of interestand the context will be influenced by the purpose of her research, her ontolog-ical view and her approach to construct definition, as discussed above. Theseare illustrated in Figure 6.

Purpose: Infer from orexplain

phenomenon

Describephenomenon

Ontologicalview

Dualist Dualist Monist; Dualist

Approach toconstructdefinition

Relationshipbetween

entity andcontext

ATR CTX

Trait Interactionalist Behaviorist

ATR ATRCTX CTX ATR/CTX

� �

Figure 6. Differing purposes and approaches to construct definition (ATR – attribute;CTX – context)

A trait approach to construct definition, illustrated in the leftmost part of Fig-ure 6, implies a dualist ontology, in which the researcher observes consistenciesin observations as indicative of an attribute (e.g., language ability) that is acharacteristic of individual language users, and that is essentially independentof the context in which language use takes place. This is typically associated


Generalizability

with the purpose of inferring from or explaining or predicting phenomena. Aspointed out by Deville and Chalhoub- Deville (this volume) and McNamara(this volume), this clear distinction between individual ability and context hasbeen the dominant approach, historically, for defining constructs in the area oflanguage assessment, where the purpose is typically to explain test performancein terms of underlying abilities, or to predict some future performance.

In a behaviorist approach to construct definition, illustrated by the right-most part of Figure 6, the attribute and the context are indistinguishable, bothfrom each other, and from the observer. This approach could imply either amonist or a dualist ontology. The “emic” perspective in ethnographic research,I believe, provides an example of a monist ontology underlying a behavioristapproach to construct definition. In this case, the phenomenon that is observedincludes all of the interactions among the participants, including those withthe researcher as participant-observer, and these interactions are situated inand part of a context. When the researcher steps back and takes the “etic”, out-sider’s stance, on the other hand, this implies a dualist ontology. In this case,a dualist ontology, with the observer observing an objective reality in whichthe attributes of language users and context are seen as a unitary phenomenon,could underlie a behaviorist approach to construct definition. Conversationalanalysis, I believe, provides an example of a dualist ontology underlying abehaviorist approach.

The last approach to construct definition, the interactionalist, illustratedin the middle part of Figure 6, is, I believe, the most recent to emerge inapplied linguistics research, and is thus the one that is still the least well-problemmatized. This approach implies a dualist ontology, but unlike the othertwo approaches, which view attributes of language users and contexts as eithercompletely separate and independent or as essentially indistinguishable, theinteractionalist approach treats attributes and contexts as distinguishable butinteracting entities. The extent to which attributes and contexts are seen to in-teract may differ from one researcher to another, or from one study to another,with the attribute and context interacting very little in some cases and a greatdeal in others. As has been pointed out by Chalhoub-Deville (2003), althoughan interactionalist approach to defining language ability is theoretically com-pelling, in terms of its recognition of language use as co-constructed throughsocial interaction, it is not without its challenges, in terms of real-world ap-plication. In language assessment, she discusses as a challenge “the notion thatlanguage ability is local and the conundrum of reconciling that with the needfor assessments to yield scores that generalize across contextual boundaries”(Chalhoub-Deville 2003:373). I would argue that this is also a challenge for


Lyle F. Bachman

any research in applied linguistics in which the purpose is to generalize beyondthe specific observations that are part of the research.

What is the range or scope of the context?Whatever perspective on context the researcher may adopt, for any given studyhe will generally need to focus, or delimit, the range or scope of the con-text he intends to observe. Duff (this volume), for example, points out that“in case studies and classroom ethnographies, it is necessary for researchersto acknowledge the delimited context- (or culture-) bounded nature of theirobservations” (p. 11). In her examples, the scope of the context ranges fromthe individual language learner to the entire classroom, to extracurricular ac-tivities. In conversational analysis, the scope of the context may be a singleconversation, or a single adjacency pair. Larsen-Freeman (this volume) gives anexample of an investigation of Wh-clefts in which the contexts included face-to-face conversations, telephone conversations, talk radio broadcasts, grouptherapy sessions, and a 1.8 million word computerized corpus of spoken aca-demic discourse, containing an amalgam of contexts. For Swain (this volume)there are two contexts: (1) the collaborative dialogue that takes place betweentwo language learners and (2) their subsequent interaction with the researcheras they verbalize in stimulated recalls. In other areas of applied linguistics re-search, such as the evaluation of language programs, language planning, orethnography, the context may be an entire educational program or a country.Contexts in applied linguistics research thus cover a vast range, in terms oftheir scope.

The observation and report

What counts as an “observation”?As has been suggested above, the phenomena we observe in empirical re-search in applied linguistics are varied, as are the ways in which we observethem. Thus, we might well ask what distinguishes the observations we makein research from the casual observations of the lay person. As I have arguedelsewhere (Bachman 2004b), I believe that our observations in research aredistinguished by two characteristics: (1) they are systematic and (2) they aresubstantively grounded. Observations in empirical research are systematic inthat they are planned and carried out in a way that is clearly described andthus accessible to other researchers and consumers of research. In other words,the observations are conducted according to explicit procedures that are ac-cepted and current practice in a particular field, and that are open to scrutiny


Generalizability

by other researchers. Observations that we make as part of research are alsosubstantively grounded, in that they are related to or based on accumulatedknowledge from prior research or a widely accepted theory about the nature ofthe phenomenon we want to describe or explain. The observations reportedin the chapters in this volume vary from elicited responses, such as gram-maticality/acceptability judgments (Larsen-Freeman) or test scores (Chapelle,Deville & Chalhoub-Deville, McNamara); and stimulated recall verbal proto-cols (Swain), to “naturally occurring” events, such as conversations (Markee),classroom interactions (Duff) and language corpora (Larsen-Freeman).

What is the unit of analysis?Related to the issue discussed above, about grain-size, is what the researcheridentifies as the unit of analysis. If a researcher wanted to investigate the re-lationship between English language learners’ (ELLs) scores on a classroomassessment and a standardized test of English proficiency, the scores receivedby the individual students might be the unit of analysis. However, recognizingthat teachers and classroom interactions may affect ELLs’ English proficiency,it might be of interest to aggregate their scores within classrooms to investigatethese effects, in which the teacher/classroom becomes the unit of analysis (e.g.,Llosa 2005). In conversational analysis, the unit of analysis might be a singleutterance, an adjacency pair, or an entire conversation, while in ethnographicresearch the unit of analysis might be a group of language users involved in aspeech event, the interactions between students and a teacher in a classroom,or the interactions that take place among members of a family or an entirecommunity.

How is the observation reported?Researchers may report their observations in a number of ways. If we give agroup of students a test, for example, we will report the results as scores, withperhaps a separate score for each part of the test. In much research in appliedlinguistics, however, the reports of research consist of words, or verbal descrip-tions of what the researcher has observed. Researchers also use pictures, charts,audio and video recordings to report their observations.

The observer’s interpretation of the observation

As illustrated in Figure 1 above, the researcher’s interpretation of the phe-nomenon is based on some performance that is observed and the report of thatobservation. The way the researcher interprets the observation will depend on


Lyle F. Bachman

her ontological stance toward the research. The credibility, trustworthiness, orvalidity of the interpretation will depend on the cogency of the researcher’s ar-gument linking the observation and the interpretation, and the persuasivenessof the evidence the researcher is able to marshal in support of that argument.

What is the researcher’s ontological stance towards the interpretation?In research whose purpose is to “draw inferences from the data collected”(Swain this volume), or to “generate new insights and knowledge” (Duff thisvolume) on the basis of observations, the way the researcher views his inter-pretation and the knowledge he claims to have acquired through the research,will depend on his ontological stance. The researcher’s ontological “stance” isthe status the researcher accords to his interpretation of his observations in aparticular study and what this represents, in terms of knowledge acquired.12 Inother words, the researcher’s ontological stance is the way he views the knowl-edge that is acquired on the basis of his research. Three different ontologicalstances are relevant to research in applied linguistics: operationalist, realist andconstructivist.

Operationalist stance. In an operationalist stance, the meaning of the con-struct is essentially synonymous with the operations or procedures that areused to observe the phenomenon. That is, the construct is defined in termsof the method of observation, and “meaning is constituted in empirical op-erations” (Bickhard 2001:2). In language assessment, an operationalist stancewould define the construct, “language proficiency”, for example, as whatever aparticular test of this measures. One of the logical consequences of an opera-tionalist stance is that each observation defines, in effect, a different construct.Despite the fact that two forms of a reading comprehension test, for exam-ple, might present test takers with very similar reading passages and ask thesame kinds of comprehension questions, the operationalist would claim thatthe abilities measured by these two tests are essentially different constructs.An operationalist stance is thus clearly problematic if the researcher’s purposeis to generalize beyond the particular observation, to either another group orcontext, or to attaching meaning to the observation in terms of an abstract con-struct. However, if the purpose of the research is to provide a rich descriptionof the phenomenon, then this could imply, essentially, an operationalist stance.

Realist stance. A realist stance holds that the phenomenon we observe existsindependently of the researcher herself and what she may believe or theorizeabout the phenomenon. According to this stance, the product of empirical re-


Generalizability

search is knowledge of largely theory-independent phenomena. Furthermore,such knowledge is possible and actual, even in those cases in which the relevantconstructs are not directly observable (Boyd 2002). Realism about the every-day world entails two claims: existence and independence. The realist claimsthat objects such as tables, rocks and the moon, all exist, as do the facts thata particular table is square, a rock is granite and the moon is spherical. Thesecond claim is that these objects in the real world and their properties (e.g.,squareness, graniteness, sphericity) are independent of what the researchermay happen to think or say about them (Miller 2004).

Constructivist stance. The constructivist stance regards the inference or inter-pretation as nothing more than a construction of the researcher’s mind, whichmay or may not be independent of the observation (Boorsboom 2003:207).13

Thus, the constructivist rejects the realist’s claim about the nature of the con-struct. For the constructivist, the knowledge that we acquire through researchis neither “real” in the realist sense, nor fixed, but is constructed by the re-searcher through her own observation of the phenomenon. Unlike the realist,who may be searching for a single, “objective”, “correct” interpretation of theobserved phenomena, the constructivist admits to, and may even seek, multi-ple perspectives on or constructions of the phenomenon. It is important not toconfuse the constructivist stance with the way in which the researcher arrivesat or constructs an interpretation. In virtually all applied linguistics research,the interpretations are constructed by the researcher, even though he may be-lieve that his interpretations are about phenomena that exist in the real world.Similarly, a constructivist stance should not be identified with the way the re-searcher views the phenomenon to be observed. Thus, many researchers inapplied linguistics view the phenomenon – speech event, conversation – asconstructed through interaction, but still view these as entities that exist inthe real world.

In my view, all of the chapters in this collection adopt an essentially realiststance. Chapelle, McNamara and Deville and Chalhoub Deville, while they maynot agree on exactly where the construct resides (in language users, in contexts,or in the interactions between language users and contexts), all discuss the con-structs that they infer on the basis of language assessments in realist terms.Chapelle discusses constructs such as “vocabulary size” and “organization of L2lexicon”, and makes repeated references to a “mental lexicon”, which I interpretas a claim that these constructs have some reality in the minds of test takers.McNamara also appears to adopt a realist stance, discussing constructs, initiallyin terms of what individual learners know and can do in a criterion domain,


Lyle F. Bachman

and as cognitive and social characteristics of test takers (p. 6). He then discussesconstructs as co-constructions within oral discourse, and as “an expression ofthe outcomes based and functionalist demands of current government policyon adult education and training” (p. 38). All of these – performance in cri-terion domains, characteristics of test takers, oral discourse, and governmentpolicy – presumably exist in a real world. Similarly, Deville and Chalhoub-Deville initially discuss language ability as “a stable construct residing withinindividuals” (p. 12), and then discuss it as “ability-in-language user-in con-text” (p. 18), which resides in the interactions between language users and thecontext. Again, I infer that these constructs within individuals, and these in-teractions in contextualized language use exist in a real world. While Markeeand Swain both clearly view the phenomena they observe (conversations andstimulated recalls, respectively) as constructed, the constructs (conversationalrepairs and thought, or cognitive processes, respectively) that they infer fromtheir observations imply a realist ontological stance. For Markee, the conver-sational repair is a real feature of human conversation; for Swain, thought andcognition exist somewhere in language users. Larsen-Freemen infers constructs(e.g., the form, meaning and use of the WH-cleft in English spoken discourse)from a variety of phenomena (prior research, intuition, observed language use,linguistic corpora), and seems to adopt the realist stance that these constructs,or patterns exist in the language use of English speakers. Duff constructs inter-pretations on the basis of her observations or perceptions, which might not bethe same as other researchers’ perceptions and interpretations. Nevertheless,she views the constructs she infers from her observations as existing in the realworld, independent and uninfluenced by her own perceptions of them.

What is the relationship between observations and interpretations?The inferential links between the observed performance, the observation re-sults, the interpretation, and the use of the research results were discussedabove, and are illustrated in Figure 1. But while the researcher may claim thatthese links are justified, if she wants to convince the various audiences to whomthe research is directed, the inferences that constitute these links need to be sup-ported by a coherent logical argument and evidence. While the specific detailsof that argument and the kinds of evidence that are acceptable, or convincing,varies across the different research approaches in applied linguistics, I wouldpropose that the basic structure of these arguments is essentially the same, andcan be characterized more generally as a “research use argument”.

Bachman (2005) has argued that in order to support the interpretationsand uses we make of individuals’ performance on language assessments, we


Generalizability

need to articulate an assessment use argument. Such an argument providesa logical framework for linking performance on language assessments to in-tended interpretations and uses, and can be used not only as a guide in thedesign and development of language assessments, but can also inform a pro-gram of research for collecting the most critical evidence in support of theinterpretations and uses for which the assessment is intended. As describedby Bachman, the structure of an assessment use argument follows Toulmin’s(2003) argument structure. At the center of an assessment use argument isa link between some data, or observations, and a claim, or an interpretationof those observations. This link is supported by various warrants, which arestatements that we use to justify the link between the observation and the in-terpretation. These warrants are in turn supported by backing, which consistsof evidence that may come from prior research or experience, or that is col-lected by the researcher specifically to support the warrant. In addition to theseparts, the argument also includes rebuttals, or potential alternative interpreta-tions or counterclaims to the intended interpretation. These rebuttals may beeither weakened or strengthened by rebuttal data (Bachman 2005:10).14

I would suggest that a similar “research use argument” can provide aframework for guiding empirical research in applied linguistics. The primaryfunction of a research use argument would not be to falsify a theory, in a posi-tivistic sense, but rather to convince or persuade a particular audience – fellowresearchers, journal reviewers, funding agencies, tenure committees, or con-sumers of the research – that the researcher’s claim or interpretation is usefulfor some purpose. To paraphrase Chapelle (this volume), I believe that manyof the problems in applied linguistics can best be addressed by research that isseen as “true enough to be useful” (p. 1). The usefulness of research in appliedlinguistics should be judged, in my view, not by the extent to which it capturesa glimpse of the “Truth”, but by the cogency of the research use argument thatunderlies the research and the quality of the evidence that is collected by theresearcher to support the claims made in the argument. I believe that usingsuch an argument as a rationale and organizing principle for research wouldenable researchers in applied linguistic to break away from our attempts to em-ulate “scientific” research in the physical sciences, and from the never-endingparadigm debate between the so-called quantitative and qualitative approachesto research. Indeed, I believe that a research use argument can provide a log-ical framework and rationale for the appropriate use of multiple approaches,or mixed methods, for collecting evidence to support the claims made by theresearcher.


Lyle F. Bachman

Combining different perspectives and approaches in research

Is the complementary use of multiple approaches in a single study desirableor even possible? What is the value added of attempting to use complemen-tary approaches? Is it possible to gain important insights about a phenomenonthrough multiple approaches? As McNamara (this volume) points out, thereare numerous examples of research in the area of language assessment thathave productively combined qualitative and quantitative approaches to yieldricher and, in my view, better supported interpretations and insights into thephenomena they have investigated.15 As mentioned above, the metaphor oftriangulation is often used to describe a viable way to obtain converging ev-idence from multiple approaches. However, multiple approaches can also beused to investigate possible alternative interpretations, or what Lynch (1996)calls “negative case analysis”, the search for instances in the data that do notfit with an interpretation suggested by other data in the study (p. 57). Thus,with respect to using multiple approaches, we need to ask, “at what level arethe approaches combined – philosophical perspective, approach to definingthe phenomenon or construct that is of interest, procedures for observing andreporting, the researcher’s ontological stance?” In many studies, it is not alwaysclear that there is a genuine combining of approaches; rather, the combina-tion appears to be opportunistic and unplanned. For example, quantitativeresearch with a qualitative “add on” to hopefully help make some sense of thequantitative analyses, or counting frequencies of occurrences of categorizationsbased on naturalistic observation, and then using a test of statistical signif-icance, might superficially combine aspects of different research approaches,without integrating these into a single coherent study. There is thus a need fora principled basis for determining, for a particular study, whether to combineapproaches, and if so, at what level, how, and why.

There is a rich literature, going back to the mid 1980’s, in the social sci-ences and education, discussing multiple approaches, or what is more com-monly referred to as “mixed methods” research, and several collections andhandbooks of mixed methods research are available (e.g., Brewer & Hunter1989; Creswell 2003; Johnson & Christensen 2004, 1998; Tashakkori & Teddlie2003). (See also www.ehr.nsf.gov/EHR/REC/pubs/NSF97-153/START.HTM).Recently, Johnson and Onwuegbuzie (2004) have argued for a pragmatic andbalanced or pluralist position to inform the use of qualitative, quantitative ormixed approaches to research. Drawing on the work of the American pragmaticphilosophers Peirce, James and Dewey, they arrive at a pragmatic principle thatis essentially the same as the criterion of the usefulness of research that I have


Generalizability

discussed above: “when judging ideas we should consider their empirical andpractical consequences” (Johnson & Onwuegbuzie 2004:17.) Extending thisprinciple to research methods, they argue that the decision on what researchapproach to take – quantitative, qualitative or mixed – for a given researchstudy should be based not on doctrinaire or philosophical positions, but ratheron the consideration of which approach is the most appropriate and likely toyield the most important insights into the research question.

Consideration and discussion of pragmatism by research methodologists andempirical researchers will be productive because it offers an immediate anduseful middle position philosophically and methodologically; it offers a prac-tical and outcome-oriented method of inquiry that is based on action andleads, iteratively, to further action and the elimination of doubt; and it offersa method for selecting methodological mixes that can help researchers betteranswer many of their research questions.

(Johnson & Onwuegbuzie 2004:17)

Johnson and Onwuegbuzie go on to describe a typology of mixed researchmethods, along with a process model for making decisions about researchmethodologies in the design and implementation of research.

Empirical research in applied linguistics has, I believe, much to learn fromthe literature on mixed methods research. As the examples of mixed methodsresearch from language assessment that McNamara (this volume) discusses,it is clear, that such research is not only feasible, but is also highly desirable,in that it expands the methodological tools that are at the researcher’s disposal,thereby increasing the likelihood that the insights gained from the research willconstitute genuine advances in our knowledge.

Conclusion

I have taken the opportunity presented in this response chapter to outline abroad view of generalizability in empirical research in applied linguistics. Ac-cording to this view, generalizability is a concept that has different aspects –consistency, meaningfulness and consequences – that provide empirical re-searchers in applied linguistics a basis for interpreting their observations, andfor considering the possible uses and consequences of these interpretations. Ihave suggested that we adopt an epistemology of argumentation that movesaway from one that seems to be driven by a preoccupation with our differences(e.g., quantitative, positivistic, theory falsification vs. qualitative, naturalistic,


Lyle F. Bachman

descriptive; “science” vs. “non-science”), and toward one that admits a widerange of empirical stances and methodologies. Many of the issues that I havetouched on here, such as the mind-body problem and monism vs. dualism,have perplexed philosophers and scientists for centuries, and continue to beenergetically debated widely in philosophy, the social sciences, cognitive sci-ence and neuroscience, to this day. Many researchers in our field have alsodiscussed issues related to epistemology and the goals of research (e.g., Gregg1993; Jordan 2004; Lantolf 1996; Long 1990, 1993; Schumann 1984; Schumann1993). I would thus be the first to say that my discussion of these issues here isby no means either authoritative or conclusive. Indeed, much of what I havediscussed may be of little interest or concern to many very productive andcreative researchers in our field. However, having recently co-taught a grad-uate proseminar in epistemology in applied linguistics, I am only too painfullyaware of how uninformed many of our students are about these issues. Thus,my primary reason for discussing such arcane topics as monism and dualism,trait, behaviorist and interactionalist perspectives, and operationalist, realistand constructivist ontological stances, is that without an awareness of these is-sues, and of the alternatives in empirical research they present, it is difficult forresearchers to apply the kind of critical self-reflection on their work that I be-lieve is essential for the health and vitality of the field. Similarly, an uncriticalapplication of one particular research methodology, without an understandingof the ontological and epistemological assumptions on which it is based, as wellas the possible consequences of the outcomes, can, I believe, lead to students’robot-like emulation of the research of senior scholars and mentors, with littleof their creativeness and insight.

Notes

* I would like to thank Carol Chapelle, Patsy Duff, Lorena Llosa and Adrian Palmer for theirhelpful comments and suggestions on earlier drafts of this chapter.

. See Bachman (2005) for a discussion of the logical structure of an assessment use argu-ment.

. As discussed below, the “construct” is what the researcher wants to describe, infer orexplain on the basis of observing some phenomenon.

. See Heider (1988) for a discussion of different ways of interpreting differences amongethnographies, and factors that may contribute these.

. See Bachman (2004b) for a discussion of procedures and approaches to validation inlanguage assessment.


Generalizability

. For an excellent discussion of validity in both quantitative and qualitative research, seeLynch (1996), from which I have drawn in my discussion of validity in qualitative research.

. See Chapelle (1998) for a discussion of the social consequences of using language tests inSLA research.

. Discussions of ethics and consequences in language testing can be found in the followingedited collections (Davies 1997b, 2004; Kunnan 2000).

. The debate about the mind-body problem has produced a wide variety of specific monistand dualist perspectives, and a discussion of this issue and these perspectives is far beyondthe scope of this chapter.

. For an excellent set of discussions of the issues surrounding the emic-etic distinction, seeHeadland (1990).

. Chalhoub-Deville and Deville (forthcoming) also characterize context as a “slipperyissue”.

. In this section, I use the term “entity” and “construct” to distinguish this from the con-text. The entity or construct can consist of a trait, or attribute of the language users, of thephenomenon itself (attributes and context not distinguished), or of the interaction betweenattributes of the language users and features of the context.

. See Borsboom, Mellenbergh and van Heerden (2003) for an excellent discussion of thesestances in the context of latent variable models in psychology.

. The constructivist ontological stance with respect to the acquisition of knowledgethrough empirical research draws on the same philosophical perspective as do constructivistviews of learning.

. See Koenig and Bachman (Koenig & Bachman 2004) for a discussion of how an assess-ment use argument could inform the investigation of the validity of interpretations basedon accommodated standardized achievement tests. See Llosa (2005) and Ockey (2005) forapplications of an assessment use argument to validation studies in language assessment.

. In addition to the studies cited by McNamara, I would point to Weigle (1994, 1998),Sasaki (1996, 2000), and Sawaki (2003).

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: OUP.Bachman, L. F. (2004a). Linking observations to interpretations and uses in TESOL research.

TESOL Quarterly, 38(4), 723–728.Bachman, L. F. (2004b). Statistical analyses for language assessment. Cambridge: CUP.Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment

Quarterly, 2(1), 1–34.Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice: Designing and developing

useful language tests. Oxford: OUP.Bickhard, M. H. (2001). The tragedy of operationalism. Theory and Psychology, 11(1), 35–44.


Lyle F. Bachman

Boorsboom, D. (2003). The theoretical status of latent variables. Psychological Review,110(2), 203–219.

Boyd, R. (2002). Scientific realism. In E. N. Zalta (Ed.), The Stanford encyclopediaof philosophy (Summer 2002 Edition): http://plato.stanford.edu/archives/win2004/entries/scientific-realism/.

Brewer, J. & Hunter, A. (1989). Multimethod research: A synthesis of approaches. NewburyPark, CA: Sage.

Campbell, D. T. & Stanley, J. (1963). Experimental and quasi-experimental designs forresearch. Chicago, IL: Rand McNally.

Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and futuretrends. Language Testing, 20(4), 369–383.

Chalhoub-Deville, M. & Deville, C. (forthcoming). Old, borrowed, and new thoughts insecond language testing. In R. L. Brennan (Ed.), Educational measurement (4th ed.).New York: American Council on Education and Macmillan Publishing Company.

Chalmers, D. (1996). The conscious mind: In search of a fundamental theory. Oxford: OUP.Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.

Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition andlanguage testing research (pp. 32–70). New York, NY: CUP.

Chapelle, C. A. (this volume). L2 vocabulary acquisition theory: The role of inference,dependability and generalizability in assessment. In M. Chalhoub-Deville, C. A.Chapelle, & P. A. Duff (Eds.), Inference and generalizability in applied linguistics:Multiple perspectives. Amsterdam: John Benjamins.

Churchland, P. M. (1996). The engine of reason, the seat of the soul. Cambridge, MA: TheMIT Press.

Creswell, J. W. (2003). Research design: Qualitative, quantitative and mixed approaches.Thousand Oaks, CA: Sage.

Damasio, A. R. (1994). Descartes’ error: Emotion, reason and the human brain. New York,NY: Grosset/Putnam.

Davies, A. (1997a). Demands of being professional in language testing. Language Testing, 14,328–339.

Davies, A. (1997b). Special issue: Ethics in language testing. Language Testing, 14(3).Davies, A. (2004). Special issue: The ethics of language assessment. Language Assessment

Quarterly, 1(2/3).Dennett, D. C. (1991). Consciousness explained. New York, NY: Little, Brown & Co.Deville, C. & Chalhoub-Deville, M. (this volume). Old and new thoughts on test score

variability: Implications for reliability and validity. In M. Chalhoub-Deville, C. A.Chapelle & P. A. Duff (Eds.), Inferences and generalizability in applied linguistics:Multiple perspectives. Amsterdam: John Benjamins.

Douglas, D. & Selinker, L. (1985). Principles for language tests within the ‘discoursedomains’ theory of interlanguage: Research, test construction and interpretation.Language Testing, 2, 205–226.

Duff, P. A. (this volume). Beyond generalizability: Contextualization, complexity, andcredibility in applied linguistics research. In M. Chalhoub-Deville, C. A. Chapelle, & P.A. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives.Amsterdam: John Benjamins.


Generalizability

Erickson, F. & Schulz, J. (1981). When is context? Some issues in the analysis of socialcompetence. In J. Green & C. Wallat (Eds.), Ethnography and language in educationalsettings. Norwood, NJ: Ablex.

Gregg, K. R. (1993). Taking explanations seriously; or, let a couple of flowers bloom. AppliedLinguistics, 14, 276–294.

Guba, E. G. & Lincoln, Y. S. (1982). Epistemological and methodological bases of naturalisticinquiry. Education Communication and Technology, 30(4), 233–252.

Hamp-Lyons, L. (1997a). Ethics in language testing. In C. Clapham & D. Corson (Eds.),Encyclopedia of language and education (Vol. 7: Language testing and assessment, pp.323–333). Dordrecht: Kluwer Academic.

Hamp-Lyons, L. (1997b). Washback, impact and validity: Ethical concerns. LanguageTesting, 14, 295–303.

Hart, W. D. (1988). The engines of the soul. Cambridge: CUP.Headland, T. N., Pike, K. L., & Harris, M. (Eds.). (1990). Emics and etics: The insider/outsider

debate. Dallas, TX: Summer Institute of Linguistics.Heider, K. G. (1988). The Rashomon effect: When ethnographers disagree. American

Anthropologist, 90(1), 73–81.Hymes, D. (1974). Foundations in sociolinguistics: An ethnographic approach. Philadelphia,

PA: University of Pennsylvania Press.Johnson, R. B. & Christensen, L. B. (2004). Educational research: Quantitative, qualitative

and mixed approaches. Boston, MA: Allyn and Bacon.Johnson, R. B. & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm

whose time has come. Educational Researcher, 33(7), 14–26.Jordan, G. (2004). Theory construction in second language acquisition research. Amsterdam:

John Benjamins.Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement,

38(4), 319–342.Kane, M. (2004). Certification testing as an illustration of argument-based validation.

Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170.Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational

measurement: issues and practice, 18(2), 5–17.Kirk, J. & Miller, M. L. (1999). Reliability and validity in qualitative research. Newbury Park,

CA: Sage Publications.Koenig, J. A. & Bachman, L. F. (Eds.). (2004). Keeping score for all: The effects of inclusion and

accommodation policies on large-scale educational assessment. Washington, DC: NationalResearch Council, National Academies Press.

Kunnan, A. J. (Ed.). (2000). Fairness and validation in language assessment: Selected papersfrom the 19th language testing research colloquium, Orlando. Cambridge: CUP.

Labov, W. (1990). Sociolinguistic patterns. Philadelphia, PA: University of Pennsylvania Press.Lantolf, J. P. (1996). Second language theory building: Letting all the flowers bloom.

Language Learning, 46, 713–749.Larsen-Freeman, D. (this volume). Functional grammar: On the value and limitations of

dependability, inference, and generalizability. In M. Chalhoub-Deville, C. A. Chapelle,& P. A. Duff (Eds.), Inference and generalizability in applied linguistics: Multipleperspectives. Amsterdam: John Benjamins.


Lyle F. Bachman

Llosa, L. (2005). Building and supporting a validity argument for a standards-basedclassroom assessment of English proficiency. Unpublished PhD dissertation, Universityof California, Los Angeles.

Long, M. H. (1990). The least a second language acquisition theory needs to explain. TESOLQuarterly, 24, 649–666.

Long, M. H. (1993). Assessment strategies for SLA theories. Applied Linguistics, 14, 225–249.Lynch, B. K. (1996). Language program evaluation: Theory and practice. Cambridge: CUP.Markee, N. (this volume). A conversation analytic perspective on the role of quantification

in second language acquisition research. In M. Chalhoub-Deville, C. A. Chapelle, & P.A. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple perspectives.Amsterdam: John Benjamins.

Mathison, S. (1988). Why triangulate? Educational Researcher, 17(2), 13–17.Maxwell, J. A. (1992). Understanding and validity in qualitative research. Harvard

Educational Review, 62(3), 279–299.McNamara, T. (this volume). Validity and values: Inferences and generalizability in language

testing. In M. Chalhoub-Deville, C. A. Chapelle, & P. A. Duff (Eds.), Inference andgeneralizability in applied linguistics: Multiple perspectives. Amsterdam: John Benjamins.

Messick, S. (1981). Constructs and their vicissitudes in educational and psychologicalmeasurement. Psychological Bulletin, 89, 575–588.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3 ed., pp.13–103). New York, NY: American Council on Education and Macmillan PublishingCompany.

Miller, A. (2004). Realism. In E. N. Zalte (Ed.), The Stanford encyclopedia of philosophy(Winter 2004 Edition): http://plato.stanford.edu/archives/win2004/entries/realism/.

Mislevy, R. J. (2003). Substance and structure in assessment arguments. Law, Probability andRisk, 2, 237–258.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educationalassessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

Ochs, E. (1979). Transcription as theory. In E. Ochs & B. B. Schieffelin (Eds.), Developmentalpragmatics (pp. 47–72). New York, NY: Academic Press.

Ochs, E. (1988). Culture and language development: Language acquisition and languagesocialization in a Samoan village. Cambridge: CUP.

Ochs, E. & Capps, L. (2001). Living narrative. Cambridge, MA: Harvard University Press.Ockey, G. J. (2005). DIF techniques to investigate the validity of math word problems for

English language learners. PhD Qualifying Paper, University of California, Los Angeles.Sasaki, M. (1996). Second language proficiency, foreign language aptitude, and intelligence:

Quantitative and qualitative analyses. New York, NY: Peter Lang.Sasaki, M. (2000). Effects of cultural schemata on students’ test-taking processes for cloze

tests: A multiple data source approach. Language Testing, 17(1), 85–114.Sawaki, Y. (2003). A comparison of summary and free-recall as reading comprehension tasks

in a web-based placement test of Japanese as a foreigh language. Unpublished PhDdissertation, University of California, Los Angeles.

Schumann, J. H. (1984). Art and science in second language acquisition research. In A. Z.Guiora (Ed.), An epistemology for the language sciences (language learning, special issue)(pp. 49–76). Ann Arbor.


Generalizability

Schumann, J. H. (1993). Some problems with falsification: An illustration from SLAresearch. Applied Linguistics, 14, 295–306.

Swain, M. (1993). Second language testing and second language acquisition: Is there aconflict with traditional psychometrics? Language Testing, 10(2), 193–207.

Swain, M. (this volume). Verbal protocols: What does it mean for research to use speakingas a data collection tool? In M. Chalhoub-Deville, C. A. Chapelle, & P. A. Duff (Eds.),Inference and generalizability in applied linguistics: Multiple perspectives. Amsterdam:John Benjamins.

Tashakkori, A. & Teddlie, C. (1998). Mixed methodology: Combining qualitative andquantitative approaches. Thousand Oaks, CA: Sage.

Tashakkori, A. & Teddlie, C. (Eds.). (2003). Handbook of mixed methods in social andbehavioral research. Thousand Oaks, CA: Sage.

Toulmin, S. E. (2003). The uses of argument (Updated ed.). Cambridge: CUP.Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing,

11(2), 197–223.Weigle, S. C. (1998). Using facets to model rater training effects. Language Testing, 15(2),

263–267.Weigle, S. C. (2002). Assessing writing. Cambridge: CUP.


Generalizability

What are we generalizing anyway?

Susan GassMichigan State University

This paper pulls together strands of generalizability from the papers in theentire volume by focusing on the why and what of generalizability. The paperreflects on issues of complementary data sources and scope of inquiry,recognizing that many authors approach the why question from differentperspectives (e.g., testing, pedagogy, acquisition). The paper is primarilyconcerned with issues of acquisition and raises questions relating to what weare trying to generalize and how we might know if we have been successful.

The papers in this volume approach the issue of generalization from widely dif-ferent perspectives. In some respects they deal with a discussion of the virtuesof one type of data over another or one approach over another. In a positivesense, this diversity helps us understand the great complexity involved in a dis-cussion of generalization and inferences and the richness of traditions availableto Applied Linguists. My own view is that there is a place for both, but we needto examine the questions that are being asked and the type of data that might ormight not be obtainable to address those questions. It is a mistake for the fieldof applied linguistics in general or second language acquisition, in particular,to think that one approach yields “better” answers. McNamara, from a testingperspective, states this succinctly: “. . .we should not be constrained from em-bracing the full range of research methods open to us, be they neo-positivistor otherwise, if they deepen (as I believe they do) our ability to understand thebases for inferences in language tests” (p. 20). That is not to say that “anythinggoes,” for it may be the case that one type of data does in fact yield a morecomplete picture (see Swain’s chapter), but a more conservative approach is tolook at all sides of the methodological issue and determine in the first instancehow approaches might complement one another.


Susan Gass

Larsen-Freeman asks the most basic question of all: why should we gener-alize and, given the context of applied linguistic research, what level of gener-alizability should we aim for? It is clear that our data need to be dependable,and it is also clear that the explanations for our data need to be generalizable.But, how do we obtain reliable and dependable data? Larsen-Freeman discussescomplementary data sources. What is not always obvious, however, is whichcomplementary data sources should be used, how they are to be gathered, andhow the data should be interpreted.

As an illustration of complementary data sources, I turn to a subset ofverbal protocols, known as stimulated recalls, also discussed by Swain. Verbalprotocols, as Swain notes, are a type of data obtained “by asking individualsto vocalize what is going through their minds as they are solving a prob-lem or performing a task” (Gass & Mackey 2000:13). I choose to discuss thismethodology because stimulated recall data represent a means to generalizewith greater confidence beyond raw quantifiable data. To exemplify this, I turnto a recent study (Polio, Gass, & Chapin 2005) that used stimulated recall datanot as an end, but as a means of supporting other data, in their specific case,interaction data. The specific research area in that study was an investigationof feedback by two groups: 10 preservice teachers and 2 experienced teach-ers.1 Polio, Gass and Chapin (2005) asked whether there was a difference inthe amount of implicit feedback in a dyadic task that was dependent on theamount of experience native speakers (in their case, teachers) have interactingwith learners. The first round of quantified results left the researchers with thedifficulty of interpreting the findings, which were not consistent with resultsof an earlier study by Mackey, Polio, and McDonough (2004). Generalizabilityand inferrability, as many papers in this volume discuss are, of course, depen-dent on making sure the results do in fact reflect some semblance of reality,either the reality of learners, as most work in second language acquisition, orthe reality of teachers, as was the case in the Polio, Gass and Chapin study. Inthat study, the stimulated recall comments allowed the researchers to ask themore general question of how these two groups of teachers perceive interac-tions with nonnative speakers. In other words, through the recall comments,they asked if one could verify the conclusions drawn from quantitative dataand thus be confident in the ability to generalize the differences found in pre-service versus experienced teachers. One such difference was noted in the waythe experienced teachers used language to get students to produce language.The experienced teachers did this through a variety of means. For example,they initiated the exchange by putting the burden on the student to begin theinteraction, as can be seen in the examples below (ET=experienced teacher).


Generalizability

The primary way of doing this was by asking open-ended questions, as in (1)and (3), or by starting with an open-ended imperative as in (2).

(1) ET1: Ok we’re looking at a picture. What does your picture look like?

(2) ET2: Yea. Um, well, we can kind of work together to describe the picture.So why don’t you start first?

S: Oh.ET2: Just tell me what’s in your picture.

(3) ET3: So, you have a picture and I have a picture, right?S: Yeah.ET3: We can see our pictures, so what we have to do is to to find what’s

different in my picture from your picture.S: Yeah.ET3: There’s ten differences, right?S: Yeah.ET3: So why don’t you tell me what’s in your picture? Just generally.

In these data we can only state that the teachers gave the learners an oppor-tunity to speak, but we cannot address issues of motivation, nor can we inferwhat the intent of the question or imperative was. We might make a leap andinfer that teachers understood the importance of output and hence wantedto provide learners with opportunities to produce language, but it is only aninference awaiting confirmation.

Following each video-taped dyadic interaction, the researchers showedeach teacher (both preservice and experienced) a videotape of his/her inter-action, asking for comments on particular parts of the interaction. Below arestimulated recall comments from these teachers that support the inferred in-terpretation of teacher behaviour.

(6) Comment, ET1: Ok, I was already thinking here, y’know, how muchshould I lead? Ok, should I- do I want to lead her a lot or how muchwill she talk? Y’know, that kinda thing.

(7) Comment, ET2: I was just trying to get her to start first so that she wouldtake the lead in communicating.

(8) Comment, ET3: I guess I was thinking, um, I’m glad she’s talkative.

The preservice teachers are quite different; their openings (9)–(12) involve spe-cific yes/no questions and their comments (13)–(17) focus on task-related orprocedural issues rather than with the need to have students produce language.This was strongly reflected in the stimulated recall comments, where they over-


Susan Gass

whelmingly commented on the nature of the task at hand or their strategies(or lack thereof) for completing the task. A few examples from the preserviceteachers are given below (PS=preservice).

(9) PS1: Um, ok. So let’s start by comparing the pictures. So, is your-is yourwindow closed?

(10) PS2: Um, is there a table in-?

(11) PS4: Well, mine looks like a picture of a dining room.S: Um-hm.PS4: Like with a window and a china cabinet and a picture and a stove

and a rug under the table. Is that what yours kind of looks like?

(12) PS7: All right. Is your picture a picture of a kitchen?

Again, as with the experienced teachers, all we can do is describe what they aredoing. That is, these preservice teachers are asking specific questions that canin fact be answered with little language.

(13) Comment, PS1: Um, yeah, at the beginning of experiment or whateveryou call it . . . activity, like I wasn’t sure exactly what to do, like, and, andso, like I just, just was kinda trying to figure this stuff out without reallythinking about the ESL student.

(14) Comment, PS2: I remember thinking that it was a difference and that thatwas the first thing that we had- I feel- I felt like he finally had found adifference in the picture. Um, and then I wasn’t sure if, I wasn’t quite sureat first if I had to be the one to keep asking questions or if we would startasking questions, but I think, eventually, towards the end, when we gotdown to the last few, we were both trying to really work on it. But I think,at first, I felt like I was doing more of the asking and he was just answering.

(15) Comment, PS4: I remember thinking, like, I should probably start becauseI’m like, I’m the native-speaker, sort of, in this and um, she wasn’t reallysaying anything, and I thought, since I have the paper and I tend towardsleadership anyway, I’m like, ok, well I’ll just start the conversation and seewhat, see what happens, so. . .

(16) Comment, PS7: . . . I-I remember at the beginning, it was hard to figureout where to start. That’s about it.

The stimulated recall comments give us confidence to assume that languageproduction is not the issue.

A second area where the stimulated recall comments demonstrated howthe experienced teachers attempted to get the NNSs to produce language can


Generalizability

be seen in their use of words such as get her/get him, have him/her, ask him/her,as can be seen in (17). But, it is only by considering the recall comments thatwe can be confident in our interpretation that learner output was important tothe way the experienced teachers approached the dyadic task.

(17) Emphasis on output

ET1: . . .I want to just see what she can say. . .ET1: I should have had her asking me questions. . ..ET2: I was just trying to get her to start first. . ..ET6: so I was gonna to try to get him to shape-to describeET6: I was going to try to see if he knew the word ‘cabinet’ET7: I was just trying to get her to describe it . . .ET7: . . .and then trying to help her get it out.ET7: Just trying to get her to elaborate. . .

Thus, through the recall comments we can see how this group of teachers fo-cused on the need to have learners produce language as a way of learning. Thisfocus was not evident in the recall comments of the preservice teachers. Con-fidence in interpretations and hence generalizations were not possible throughan inspection of the quantitative data alone. Chapelle argues, in relation to vo-cabulary assessment, that the crucial link is the concept of inferences, a pointI return to below. “What is needed is not a single method of measurementbut defensible inferences to appropriate constructs and a single framework”(p. 12). In other words, the way we make inferences, or interpret the data, is anessential ingredient of good research. In the case of stimulated recalls, the stepof inferencing is reduced, if not in some cases eliminated, because we take anindividual’s statement as the valid interpretation of data (see Gass & Mackey2000 for a discussion of the strengths and limitations of this methodology).

Larsen-Freeman answers her own question about the if of generalizationand argues that generalization within limits is a necessary part of applied lin-guistics research “. . .I have staked out the position that generalizability is de-sirable – that what we would like is to have claims that apply beyond particularinstances” (p. 21). It is my view that we must have the capacity to generalize, for,if not, research remains at the “that’s interesting” stage rather than moving thefield along in any theoretically serious way. Larsen-Freeman goes on to pointout that there are limits of generalizability “at least for applied purposes.” Shespecifically states “While some linguists pursue the broadest possible general-izations, which are therefore necessarily abstract. . .it is difficult to see how suchabstract principles are useful for all the purposes to which applied linguistswish to put them” (p. 21).


Susan Gass

This brings us to a central question of what the scope of inquiry is for whichwe are making generalizations. Larsen-Freeman’s emphasis seems to be on gen-eralizations about language for pedagogical purposes. Others in this volumeare more focused on issues of assessment (e.g., Chapelle, McNamara, Deville,& Chalhoub-Deville), while others (e.g., Swain, Markee) are focused on issuesof learning.

Larsen-Freeman is concerned with actual language use, including form,meaning and contextual use (that is, pragmatics). It is important, as she pointsout, to have a large corpus of data in order to understand the when, why, andhow of language, and, within the context of this volume, to be able to generalizethe data to new contexts. In other words, it is only with large tokens of data in arange of contexts that we can begin to provide answers to the questions of whatand why.

Chapelle, in her paper about vocabulary assessment, makes the importantpoint that we need to consider inferences, that is the interpretation of per-formance. With regard to generalizabilty, she links the concepts of inference,dependability (similar to what I referred to above as confidence), and general-izability. With specific regard to vocabulary assessment, she brings in a basicassumption that there is similarity across tasks: “[t]he idea is that when perfor-mance is observed on one task, it is assumed that this performance is a goodrepresentative of what would be displayed, on average, across a large numberof similar tasks. . .” (pp. 12–13). This being the case, we infer that task perfor-mance on one task generalizes across tasks. If this is the case, then performanceis dependable across tasks. All of this, as she notes, is dependent on the constructdefinition that a test is based on.

Both McNamara and Deville & Chalhoub-Deville approach the issue ofgeneralizability from a strict testing framework. McNamara states this suc-cinctly: “Language tests are procedures for generalizing” (p. 2). In fact, theresults per se are not of interest. Rather, results are of interest “only as a ba-sis for generalizing beyond the particular performance” (McNamara, p. 10).Thus, the issue that is of interest is: how far can we generalize? For example,Deville and Chalhoub-Deville make clear what language testers aim to do: “Welanguage testers wish to make generalizations – sometimes across test takers,other times across items or tasks, and/or at still other times across occasions orcontexts” (p. 2). What is interesting is the notion of “across. . .contexts,” whichthey return to at the end, by calling for “the interaction of language user andcontext” (p. 19) as the object of study. McNamara picks up on this point by“drawing inferences about the test-taker’s ability beyond the immediate testingcontext” (p. 1). He brings in the concept of sociopolitical issues when talking


Generalizability

about generalizations and makes it clear that any sort of testing or generaliza-tion that comes out of it must be cognizant of the consequences of such testsand generalizations.

Thus, within the realm of testing, the idea of the significance of general-izability is unquestioned; rather, the question refers back to the purpose thattesting is put to and the context (broadly interpreted) in which testing takesplace and which the results often impact.

Swain and Markee both focus on issues of second language acquisition.Markee sees the issue of generalizability as secondary within a conversation an-alytic perspective. What becomes primary in this tradition is an understandingof how talk is organized. Within a CA context, he demonstrates how a detailedanalysis of classroom talk can lead to an understanding of the organizationand social interaction of talk in a way that quantitative measures, which wouldundoubtedly allow greater inferences and generalizability, cannot.

Swain takes the perspective that, in my view, we should be taking whenlooking at learning. First, she is concerned with the extent to which researcherscan generalize from the inferences gained from data collected in “local andsituated contexts” (p. 1) (which is clearly related to all data in SLA). Similarto the Polio et al. (2005) study discussed above, she analyzes stimulated recalldata from French immersion students. In this and other research (e.g., Swain& Lapkin 1998, 2002, in press), she convincingly shows the role that verbal-izations have on cognition. In other words, the recall data actually impactedlearner performance. If we return to the notion of generalizability and in par-ticular generalizability in terms of methodology, she argues that methodologiesare not neutral. Specifically, with regard to verbal protocols, they are more thana means for eliciting data; they profoundly affect learning.

I return to the issue raised by Larsen-Freeman regarding breadth and ab-stractness. She specifically ties this into purposes that “applied linguists” dealwith. While taking us away from the direct focus of this volume, namely, gen-eralizability, it is important to be clear on “who we are” as researchers. Thisvolume has “Applied Linguistics” in its title and, further, it includes researchfocused on SLA. It is important, however, that it be understood that not allSLA research has an applied focus (e.g., work focused on SLA from a UGperspective); rather a great deal of SLA work, having no applied focus, is infact interested in abstractness. Further, for SLA research, it is important, if notcrucial, to generalize across languages (both native languages and target lan-guages) and across contexts. One way of doing this is through replication (seePolio & Gass 1997), one way of arriving at a better understanding of the phe-nomenon of learning. And, perhaps this is the point that Duff is raising when


Susan Gass

she discusses controlled and laboratory-like studies – the more controlled, theless generalizable. I would add that this is precisely where replication is neces-sary. In SLA research, one is often generalizing to models whereas in researchwith a more applied focus, the focus is on generalizations to populations (seeDuff ’s paper).

With specific regard to generalizability, Duff is particularly clear on thispoint in stating that with quantitative research it is relatively easy (with appro-priate design and sampling procedures) to come up with generalizable results.Duff ’s point about paradoxes is particularly well-stated: what should we gen-eralize? And how do we know what is generalizable? Quantitative? Qualitative?One case study?

As Applied Linguists, we do live within the real world and are subjectedto real-world constraints on data. This is clearly exemplified in Markee’s andDuff ’s papers, who both acknowledge that issues of generalizability are placedin the background of their research paradigms, and perhaps in all or most qual-itative research, and in Markee’s case the particular context of ConversationAnalysis. As Duff says “[g]eneralizability to larger populations, in the tradi-tional positivist sense, or prediction is not the goal” (pp. 8–9) (see also Mackey& Gass 2005, particulary Chapter 6). Duff further points out that the term gen-eralizability itself is a loaded one, belonging more appropriately to a positivisttradition than to a tradition that is more concerned with context-bounded in-terpretation. But regardless of how we view the notion of generalizability, itis important that we exploit the possible synergy that can be generated by amultiplicity of approaches. So, the tension that is often created between quan-titative and qualitative may be a red herring. Rather, it is merely a reflection ofour particular research questions, and perhaps the issue is which research areais more interesting. There may theoretically be some mid-point on the con-tinuum where the tension might be real, for example, large-scale research inschool settings. Even in this context, however, certain types of research ques-tions truly drive one to one methodology or another. Thus, it may actuallynot be a question of mid-point tension. Because educational research has topay attention to student voices and teacher voices and, perhaps more impor-tant, the interaction and dynamics of many factors that students face, includingthe home environment, generalizability may not be as much an issue as is theneed to develop an understanding of the various factors that affect childrenin schools, for example, ESL children in schools. In some arenas, quantitativedata are not appropriate (see Crandall 2002). The logistical problems of doingquantitative data collection with immigrant children in a school environment


Generalizability

(e.g., bad data bases, mobility) are many and this may, indeed, be an argumentof why qualitative data might be more appropriate for a particular study.

What I want to do in the remainder of this discussion chapter is spin off onthe linguistic area of Chapelle’s paper, that of acquisition of vocabulary, and ina sense on the content area of the paper by Swain in which she is talking aboutprotocols as an impetus for learning. In particular, I want to problematize theissue of generalizability and deal with the question of learning and knowledge.What is it? And, how do we know when we get there?

We often lose sight of the fact that our definition of learning is depen-dent on how we view the construct in question, a point that is a focal area ofChapelle’s contribution. If we think of a model of learning such as Connec-tionism, learning involves ever increasing strengthening of connections. Othermodels of learning take a more traditional view of learning, basing knowledgeon correct answers on a test or a certain amount of output. Such concepts aspercentage of correct instances in obligatory contexts were prevalent in early re-search on second language acquisition (e.g., Bailey, Madden, & Krashen 1974;Dulay & Burt 1974; Pica 1984). But it is not uncommon for researchers to usesome measure such as 4 out of 6. Talking about generalizability of knowledgeor learning cannot take place without a theory of what it means to say thatlearning has taken place (why would 4/6 be chosen and not 5/6?). Looking atMarkee’s example, he states that his excerpt illustrates issues about the relation-ship between second language acquisition and use. Unfortunately, the excerptdidn’t go far enough to determine learning, but it appears to be the case that theteacher turns to the class (lines 56–58) to see if someone could define “pretend,”but we don’t have information about how learning is defined – a necessary pre-requisite to saying something about the relationship between acquisition anduse. Swain’s data are interesting in that she talks about learning moments re-sulting from verbalization, in particular through verbalization. Perhaps this iswhat Markee referred to when he mentioned the criticism of self-report datadistorting previous behaviour. But Swain’s point is an important one: verbal-izations are part of the process of cognitive change. They may, in fact, distortprevious behaviours, but that is the point. They represent moments of changeand, if this is correct, they should “distort” previous behaviours in the sensethat they should be different.

In what follows I present an example of some of the difficulties in deter-mining what “knowledge” means and hence in knowing what we are gen-eralizing. As Duff said, and as mentioned above, how do we know what isgeneralizable? So, with this example, the intent is to problematize some of these


Susan Gass

issues. The exchange reported below happened to eight applied linguists andreflects the thoughts of one of them.

Socializing the cost: A learning moment: Or is it?Location: 8 women, attending a conference in a large Midwestern city, at a

French Vietnamese restaurant.Participants: 6 NSs and 2 NNSs although all the principal players in this

episode were NSs of English.SharonDanaPaulaMary

Background: Prior discussion about connectionism and differences betweenrepresentation and processing and what it means to acquiresomething and how do we measure acquisition (e.g., 90%suppliance in obligatory context?). Discussions included themeaning of learning within a connectionist framework as beingstrengthening of connections, the possibility that learningconstantly takes place if only to involve continued strengtheningof associations, and the possibility that learning means ability togeneralize to novel contexts. Sharon boldly states that the onlylearning that will take place at dinner that night would be contentlearning; she would learn nothing new relating to language.

Immediate context:Paula was negotiating with the waitperson about the bill. Shewanted the wine (2 bottles) to be charged to her and to Marywith the food to be divided amongst the 8 individuals present.

Conversation: Dana: (overhearing the conversation between Paula and thewaitperson): I don’t mind socializing the cost.

Sharon: (only partially attending, was struck by the turn ofphrase): What?

Dana: I don’t mind socializing the cost.Sharon: (now enamoured by this expression, turns to Mary):

You and Paula are picking up the wine tab, but Danasays that she doesn’t mind socializing the cost.

Mary: What?Sharon: Dana says that she doesn’t mind socializing the cost of

the wine.

What followed was a continuation of this conversation, but now involvinga fifth participant. This conversation revolved around the use of this phrase


Generalizability

with the participants attempting to generalize this new phrase to new contexts,for example:

socialize the supervision of theses

but

*socialize the workload.

Dana (the NS of “socializing the cost”) and Sharon had similar intuitions aboutwhere this phrase could and could not be used. Sharon had by now admittedthat she did learn something new about language that evening and had agreedto be evaluated on a posttest in three weeks time.

Following this event, Sharon thought about the conversation and becameconvinced that she had learned this phrase and had even showed some goodknowledge of its limitations (i.e., she had good intuitions); she was worriedthat she might forget (by now confusing the word with “sharing the cost”). Shetherefore rehearsed and rehearsed and rehearsed in her mind. So, the questionto be asked is: When did learning take place? Was it

– At the moment of hearing the phrase– During the subsequent discussion and verbalization of the limitations– During the rehearsal phase– Or, at a later time when she begins to use it?

Verbalization, as Swain describes it, can definitely be a learning moment, butlearning is not often a one-time event; it may require follow-up information(see Gass 1997) which can inter alia consist of verbalization, hearing some-thing, thinking about it. I have argued elsewhere that interaction can be a prim-ing device (Gass 1988). It is that moment when attention is drawn to some-thing, but there needs to be follow-up to reinforce or possibly to strengthenassociations. I bring up this example and my thinking about it to raise thequestion in the context of generalizability as to what it is we are generalizing.We need to know when learning takes place to truly understand the conceptof generalizability and how it takes place. This is precisely Chapelle’s point; weneed to understand the construct – knowledge of language. And, finally, weneed to know the domain of generalizability.


Susan Gass

Note

. The details of the two groups of teachers go beyond the scope of the paper. However, asa general comment, suffice it to say that, of the preservice teachers, no one had any ESLteaching experience, although 5 of the 11 had tutored non-native speakers of English. In theexperienced teacher group, the range of ESL teaching experience was from 4 to 27 years.

References

Bailey, N., Madden, C., & Krashen, S. (1974). Is there a “natural sequence” in adult secondlanguage learning? Language Learning, 24, 235–243.

Crandall, J. (2002). A delicate balance: The need for qualitative research in an increasinglyquantitative environment. Paper presented at American Association for AppliedLinguistics, April, Salt Lake City.

Dulay, H. & Burt, M. (1974). Natural sequences in child second language acquisition.Language Learning, 24, 37–53.

Gass, S. & Mackey, A. (2000). Stimulated recall in second language research. Mahwah, NJ:Lawrence Erlbaum.

Gass, S. (1988). Integrating research areas: A framework for second language studies. AppliedLinguistics, 9, 198–217.

Gass, S. (1997). Input, interaction and the development of second languages. Mahwah, NJ:Lawrence Erlbaum.

Mackey, A. & Gass, S. (2005). Second language research: Methodology and design. Mahwah,NJ: Lawrence Erlbaum.

Mackey, A., Polio, C., & McDonough, K. (2004). The relationship between experience,education, and teachers’ use of incidental focus-on-form techniques. LanguageTeaching Research, 8, 301–327.

Pica, T. (1984). Methods of morpheme quantification: Their effect on the interpretation ofsecond language data. Studies in Second Language Acquisition, 6, 69–78.

Polio, C. & Gass, S. (1997). Replication and reporting: A commentary. Studies in SecondLanguage Acquisition, 19, 499–508.

Polio, C., Gass, S., & Chapin, L. (2005). Preservice and Experienced Teachers’ Perceptions ofFeedback during Interaction. Paper presented at Voice and Vision in Language TeacherEducation Conference, University of Minnesota, June.

Swain, M. & Lapkin, S. (1998). Interaction and second language learning: Two adolescentFrench immersion students working together. Modern Language Journal, 82, 320–337.

Swain, M. & Lapkin, S. (2002). Talking it through: Two French immersion learners’ responseto reformulation. International Journal of Educational Research, 37, 285–304.

Swain, M. & Lapkin, S. (in press). “Oh, I get it now!” From production to comprehensionin second language learning. In D. M. Brinton & O. Kagan (Eds.), Heritage languageacquisition: A new field emerging. Mahwah, NJ: Lawrence Erlbaum.


Negotiating methodological rich pointsin applied linguistics research

An ethnographer’s view*

Nancy H. HornbergerUniversity of Pennsylvania

“Methodological rich points” are those points that make salient the pressuresand tensions between the practice of research and the changing scientific andsocial world in which researchers work – points where our assumptionsabout the way research works and the conceptual tools we have for doingresearch are inadequate to understand the worlds we are researching. In thispaper, I highlight some methodological rich points around issues of inferenceand generalizability in applied linguistics research, drawing on the papers inthis volume and on my own and others’ ethnographic applied linguisticsresearch. At the same time, I seek to reframe these issues in the context ofmore basic questions of research methodology and ethics.

Introduction

Applied linguists research language learning and teaching in a range of settingsincluding classrooms, communities, test and experimental situations, using avariety of methods, tools, and strategies, in order to understand, inform policy,and transform language teaching/learning. This is, at its most basic, the appliedlinguistics research endeavor. Within that endeavor, there are diverse ways ap-plied linguists might approach issues of inference and generalizability, and theauthors in this volume provide a substantial and thoughtful sample, explor-ing limits and possibilities of these principles in advancing applied linguisticsresearch within their particular specializations.

For me as an ethnographer of language and education, these papers pro-voke reflection on what I have come to think of as “methodological rich points,”points that make salient the pressures and tensions between the practice of re-


Nancy H. Hornberger

search and the changing scientific and social world in which researchers work.1

In other words, methodological rich points are those times when researcherslearn that their assumptions about the way research works and the conceptualtools they have for doing research are inadequate to understand the worlds theyare researching. When we pay attention to those points and adjust our researchpractices accordingly, they become key opportunities to advance our researchand our understandings.

I have borrowed and adapted the term methodological rich points fromethnographer Michael Agar’s notion of “rich points” as those times in ethno-graphic research when something happens that the ethnographer doesn’t un-derstand; those times when “an ethnographer learns that his or her assump-tions about how the world works, usually implicit and out of awareness, areinadequate to understand something that had happened” [in the corner ofthe world he or she is encountering] (1996:31). Agar discusses rich points asone of three important pieces of ethnography: participant observation makesthe research possible, rich points are the data you focus on, and coherence isthe guiding assumption by which you seek out a frame within which the richpoints make sense (1996:32). Rich points, then, are points of experience thatmake salient the differences between the ethnographer’s world and the worldthe ethnographer sets out to describe. Methodological rich points are, by ex-tension, points of research experience that make salient the differences betweenthe researcher’s perspective and mode of research and the world the researchersets out to describe.

The authors herein provide insight into just such methodological richpoints around issues of inference and generalizability in applied linguistics re-search. In what follows, I try to highlight these, drawing on the papers in thisvolume and on my own and others’ ethnographic applied linguistics research.At the same time, I seek to reframe these issues in the context of more basicquestions of research methodology and ethics. As a sociolinguist, I borrow asorganizing rubric to do this the paradigmatic heuristic for sociolinguistic anal-ysis first offered by Fishman (1971:219) and here adapted to applied linguisticsresearch to ask: who researches whom and what, where, how, and why?

Who researches whom in applied linguistics?

At the most basic level, applied linguists research language learners and users –teachers, students, community members, policy makers. The papers here areno exception – the research described is that of applied linguists researching


Negotiating methodological rich points in applied linguistics research

students, teachers, and language learners. There is not a lot of problematizingof the researcher/researched relationship, beyond on the one hand a concernwith finding ways to involve participants in the research to enhance findings,and on the other a concern with paying attention to the social consequencesof testing practices for those tested. The former is exemplified by involvingparticipants in member checks (Duff) or stimulated recall (Swain) as ways ofdeepening the researchers’ understanding and interpretation of observed be-haviors. The latter is explicitly addressed as an aspect of validity in Messick’sframework as discussed by McNamara.

For Duff, the use of member checks, whereby the researcher consults withparticipants during analysis and write-up of the findings, is one of several waysof attending to establishing the credibility of research findings and generaliza-tions (along with other strategies such as sufficient amount and diversity ofdata, consideration of counter-examples, triangulation not only of data andmethods but also of theory and researchers). Beyond consulting participantsfor the sake of credibility, there is, particularly in ethnographic research, an-other set of participant-related concerns I want to highlight here as a method-ological rich point – these are the concerns around questions of collaboration,authority and representation.

Authority refers to the researcher’s authority over the interpretation of thedata – the right to claim that he or she has ‘got it right’ in reporting findings.On what basis does the researcher have (or not) authority to speak for the par-ticipants (the researched)? This issue is closely linked to that of collaboration.“Ethnographic research is collaborative . . . It’s always been that way. . . Whatthe new ethnography calls for is attention to the way collaborative work leadsto the results” (Agar 1996:16). Agar goes on to note that the authority issuealso puts the spotlight on the ethnographer and the question of who studieswhom (1996:17), leading to questions about who is self and who is other, andeven what is emic and what is etic (Agar 1996:21).

In applied linguistics research, this methodogical rich point has been force-fully and articulately raised in terms of the slogan “research on, for, and withsubjects,” put forward by Cameron, Rampton and colleagues (Cameron et al.1992). After first discussing issues of power and of positivist, relativist, andrealist paradigms of research, the authors introduce a distinction between anethics-based approach (research on subjects), which seeks to balance the needsof a discipline in pursuit of knowledge with the interests of the people on whomthe research is conducted; an advocacy-based approach (research on and forsubjects), which despite its commitment to participants nevertheless still tendstoward a positivist notion that there is one true account; and an empowerment-


Nancy H. Hornberger

oriented approach (research on, for, and with subjects), which uses interactive,dialogic methods and seeks to take into account the subjects’ research agenda,involve them in feedback and sharing of knowledge, consider representationand control in the reporting of findings, and take seriously the policy-makingimplications of the research. The authors clearly advocate the last approach andoffer examples of attempts to implement it in their own research.

Educational ethnographer Reba Page speaks of a crisis in representationin qualitative research. She writes that increasing recognition of limits to “thequalitative claim that researchers could document and explain, fully and ac-curately, another’s life-world as it is” (Page 2000:5) presents both an aestheticchallenge centering on how knowledge is represented in texts and a politicalchallenge centering around whose representations are the ones put forward.The political challenge has given rise to new interdisciplinary alignments, field-work relations, and advocacy stances (Page 2000:6–7). Similarly, in response tothe aesthetic challenge, scholars have “experimented with modes of reproduc-tion that [give] more prominence to their own meaning-making, the artfulnessof accounts, and the diverse ‘voices’ and alternate views of informants” in theform of dialogic scripts, collaborative authorship, autobiographical ethnogra-phies, and even novels (Page 2000:6).

Paying more attention to issues of authority, collaboration, and represen-tation in applied linguistics research may take a number of forms – it maybe about working with multiple members of a research team; it may also beabout relationships between researcher and researched; and may range fromconsultative to fully participatory relationships. It may be about collecting andanalysing data; it may also be about writing up and reporting findings. It iswithout doubt about incorporating multiple voices in the research process andproducing multi-voiced texts.

What do we research?

A succinct (and partly oversimplified) way of stating what it is that appliedlinguist researchers concern ourselves with is that we research language teach-ing and learning, as well as the role of language in learning and teaching (cf.Hornberger 2001 on educational linguistics). In the papers collected here wefind these concerns instantiated in investigations of: language learning and in-structional practices in classrooms at elementary, secondary, and college levels(Duff), language acquisition and use in classroom interaction (Markee), the re-lation between language and thought (Swain), the ways people mobilize gram-



matical resources to accomplish communicative and social purposes (Larsen-Freeman), the acquisition of vocabulary in a second language (Chapelle), andmethods of assessing what a language learner knows and can do (McNamara;Deville & Chalhoub-Deville).

Striking for an ethnographer in reading these essays is the extent to whicha qualitative concern for context and contextualization comes up in relationto the content (the “what”) of the authors’ research and its validity, reliability,or dependability, even in research which does not necessarily use qualitativemethods. McNamara discusses at length the four quadrants of Messick’s va-lidity framework for test constructs, of which three are about contextual uses,values, and consequences attached to the test; he also draws attention to thegrowing body of work using discourse-based approaches to the validation oforal language assessment. Larsen-Freeman appeals to contextual analysis as aresearch methodology for accounting for form, meaning, and use of linguis-tic forms – in this case, context refers primarily to the discourse context inwhich naturally observed spoken and written language occurs. Swain arguesthat verbal protocols are not simply a means of reporting what one is thinking,but rather actually enable changes in thinking; they are therefore not a neu-tral means of collecting data (i.e. content), but in fact are part of the treatment(i.e. context). Markee demonstrates through detailed conversation analysis of astretch of classroom interaction that the social relationship between the teacherand learner (i.e. language-learning context) had to be repaired before furtherlanguage-learning oriented talk (i.e. language-learning content) could occur.Meanwhile, Duff, writing explicitly as a qualitative researcher, includes contex-tualization and ecological validity of tasks among other means of addressing in-ternal validity in qualitative research; and goes on to provide examples from herown ethnographic classroom research in Hungary of how contextualization –specifically, attention to sociopolitical and historical context, the participantsand their interests, the tasks or instructional practices used and the partici-pants’ understandings of these – is crucial for establishing internal validity ofher findings. Duff emphasizes the need to consider context from the point ofview of how “sociopolitical structure not only influences and mirrors, but alsois constituted in” language learning and teaching events and interactions ineveryday classrooms.

These various insights point to another methodological rich point in ap-plied linguistics research – namely the recognition that knowledge, and specif-ically language learning, is co-constructed in contexts of social interaction.McNamara takes note of the growing critique of individualistic models of pro-ficiency, in favor of work that stresses the co-construction of performance


Nancy H. Hornberger

(raising questions around, for example, what exactly is being assessed in anoral proficiency interview). Markee offers a social constructivist critique ofsecond language acquisition (SLA) research on social interaction. Duff recallspsychologist Lee Cronbach’s early (1982) paradigm-shifting acknowledgementthat “human action is constructed, not caused.” Swain adopts as premise thatlanguage, and specifically verbalization, is constitutively involved in thinking.Larsen-Freeman is interested in how people use language (specifically gram-matical resources) to, among other things, manage interpersonal relationshipsand position themselves socio-politically. These observations rest on socialconstructionism, the “view that instead of being the product of forces that ac-tors neither control nor comprehend, human reality is extensively reproducedand created anew in the socially and historically specific activities of everydaylife” (Rampton 2000:10; citing Giddens 1976, 1984).

For sociolinguistic ethnographers and linguistic anthropologists of edu-cation, social and cultural context has always been the bedrock of researchmethod, evident in such long established strands of work as the ethnographyof communication – documenting and comparing ways of speaking (Hymes1964, 1968; Philips 1983; Heath 1983), interactional sociolinguistics – revealingthe multiple linguistic means by which we embed social meanings in inter-action (Gumperz 1982), and microethnography – demonstrating the impor-tance of situationally emergent social identity and co-membership (Erickson& Shultz 1982) (see Hornberger 1995 for a review of these three sociolin-guistic approaches to school ethnography). More recently, scholars working inthese traditions have begun to frame their research with more explicit attentionto social constructionism, documenting patterns of language use and socialrelations in multilingual classrooms and communities, and exploring dimen-sions of discourse that maintain the status quo in societal power relations (e.g.Martin-Jones & Jones 2000; Heller & Martin-Jones 2001; Hornberger 2003;Wortham & Rymes 2003; Creese & Martin 2003; Arkoudis & Creese 2005;McCarty 2005). There is also increasingly explicit attention in this work to theheterogeneous, multilingual, multicultural, multiliterate classroom and com-munity contexts in which language learning, teaching, and use take place. Thiswe take up next.

Where do we research?

Classrooms at all levels, community sites of formal and non-formal education,testing and (semi-) experimental situations are the usual venues for applied



linguistics research, and the papers herein are no exception. In keeping withthe volume’s theme of generalizability, the typicality or representativeness ofsites or cases comes up in several of the papers, usually with the cautionaryacknowledgement that typicality is quite elusive (and perhaps in some casestoo easily assumed). Larsen-Freeman points out that intuitional data on sen-tence grammaticality must be complemented by other data (observational andelicited), since “intuitions can be unreliable or undependable when looking fortypical patterns because humans tend to notice the unusual more than the typ-ical.” Duff reminds us that typicality may not always be the aim in selectingsites or cases, anyway; atypical, unique, resilient, extreme, or even pathologicalcases or instances may be purposely sought out for the potential insight theyoffer – these are so-called intrinsic cases, in Stake’s (2000) terms, as distinctfrom instrumental cases examined explicitly in relation to a generalization.

Markee offers a two-pronged argument about research settings and typ-icality in discussing what he calls the domain problem, following Schegloff(1993). On the one hand, he acknowledges that Conversational Analysis (CA)originally focused on the study of ordinary conversation and has only morerecently taken up the analysis of institutional talk (e.g. news, medical, court-room, or classroom talk), and that one can’t necessarily generalize from theformer speech exchange system to the latter. On the other, he argues that thepredominantly experimental tradition in second language acquisition researchon negotiated interaction has not adequately studied the interactional speechexchange system (i.e. domain) in which the negotiation occurs. He suggeststhat CA work on classroom interaction may provide a means to address thismutual lack and a qualitative basis on which to build a quantified search forgeneralizations.

Apart from typicality of settings, the papers also allude to the complex-ity of any particular research setting. Indeed, when Duff suggests that onemeans of enhancing generalizability of cases or settings is to conduct multi-siteor multiple-case studies, she acknowledges that this nevertheless brings withit a concomitant reduction of possibilities for in-depth description and con-textualization of the kind that would do justice to the complexity of any onesite or case.

For applied linguists, one very clear aspect of research setting complex-ity – one which constitutes another methodological rich point for appliedlinguistics research – is the increasingly diverse range of settings where wedo research, along with the increasingly heterogeneous, multilingual natureof those settings. So that, in addition to research in classrooms or languagelearning or testing settings, we also have (comparative) ethnographic studies


Nancy H. Hornberger

of language, literacy, and education in school-and-community settings (e.g.Heath 1983; Hornberger 1988; Delgado-Gaitan 1990; McLaughlin 1992), inout of school settings such as adult literacy programs, workplaces, religioussettings (e.g. Heath & McLaughlin 1993; Spener 1994; Knobel 1999; Hull &Schultz 2002), in bilingual and multilingual classroom settings around theworld (e.g. Heller & Martin-Jones 2001; Creese & Martin 2003), in languageeducation professional development and practice settings (e.g. Henning 2000;Pérez et al. 2003; Brutt-Griffler & Varghese 2004; Hawkins 2004; Arkoudis &Creese 2005), and in language education policy-making settings and activities(e.g. Tollefson 1995, 2002; Freeman 2004; Johnson 2004; Tollefson & Tsui 2004;Canagarajah 2005b).

For applied linguist ethnographers, the crux of the methodological richpoint here is the changing nature – or perhaps more accurately, the deepen-ing understanding – of the concept of speech community as research settingfor the ethnographic study of language use and language learning. Defined insociolinguistic work of the 1960s as a community whose members share atleast one language variety and the norms for its use (Hymes 1972:54; Fishman1971:232), the underlying assumption in the concept is not that there is uni-formity of communicative resources and practices within a speech community,but rather that there is a patterned diversity of those resources and practices:as Hymes often repeated, it is “not replication of uniformity but organizationof diversity” (Hymes citing Wallace 1961). The task of the ethnography of aspeech community is to “Take as context a community, investigating its com-municative habits as a whole, so that any given use of channel and code takes itsplace as but part of the resources upon which the members of the communitydraw” (Hymes 1964:3).

In response to the rise of post-modernity and social constructionism,Rampton (2000) tells us, analysis of the speech community has moved on theone hand toward “investigation of ‘community’ as itself a semiotic sign andideological product” and on the other toward “close-up analysis of face-to-faceinteraction in relatively consolidated social relationships”, sometimes termed“communities of practice (e.g. unions, trades, boards of directors, marriages,bowling teams, classrooms)”, i.e. “a range of social relationships of varyingduration” (Rampton 2000:10, 12). Since the 1990s, there is a shift from an ex-clusive focus on speech communities (and descriptions of deficit, differenceor domination across them) to also focus on communities of practice andcommunicative practices, yielding “fine-grained and complex account[s] ofimposition, collusion and struggle” (Rampton 2000:12), where randomnessand disorder are more important than system and coherence, and anoma-



lous social difference is treated as central rather than peripheral (Rampton2000:9, 18). These tendencies are evident in ethnographic research in all thevariety of speech communities mentioned above, from classroom to school toout-of-school and beyond.

Indeed, this methodological rich point is not only about the increasinglyheterogeneous nature of any one research setting, but also about the increas-ingly diverse range of settings in which applied linguistics research takes place –from face-to-face interactions at the micro-level to policy discourses and glob-alizing forces at the macro-level. This is particularly true of ethnographic andsociolinguistic studies, which increasingly turn their attention toward the dis-courses of language planning and policy as well as those of classroom andcommunity interaction, encompassing both the global and the local. Thereis growing recognition that language planning and policy-making happens asmuch at the micro-level of the classroom as it does at the macro-level of gov-ernment (Ricento & Hornberger 1996; Ricento 2006); acknowledgement of thetensions in language in education policies and practices, especially in post-colonial contexts undergoing simultaneous and contradictory processes of de-colonization and globalization (Lin & Martin 2005); and movement towarda more localized orientation that takes seriously the tensions, ambiguities, andparadoxes of language allegiances and sociolinguistic identities in order to con-struct policies from the ground up (Canagarajah 2005a; see also Hornberger1996). Ecological approaches, in particular, have been proposed as a way to dothis (Hornberger 2003; Canagarajah 2005a, 2005b).

How do we collect, analyze, and interpret data?

As applied linguists, our primary data are bits or stretches of spoken or writ-ten language, collected primarily by observation, recording, elicitation, andtesting; analyzed usually in some way for form, function, and meaning; andinterpreted within a variety of conceptual frameworks ranging from highlyspecified to more loosely configured. Among the methods of data collectiondiscussed or exemplified in this volume are transcription of recorded ordinaryconversation/institutional talk (Markee); ethnographic participant observa-tion (Duff); intuition, observation and elicitation (Larsen-Freeman); collab-orative dialogue, verbal protocols including concurrent think-alouds and ret-rospective introspection or stimulated recall, and videotaped interaction in theclassroom (Swain); and assessment or testing (McNamara; Chapelle; Deville &Chalhoub-Deville).


Nancy H. Hornberger

All the authors define and discuss how they approach inference, as in-tegral to processes of data analysis and interpretation. Whether it be frompeople’s observable behavior to their underlying knowledge systems (Duff),from language performance to forms, meanings, functions (Larsen-Freeman),from observed test performance to performance under non-test conditions(McNamara), or from observed performance to what the performance means(Chapelle), inference is always about the logical connection the researcherdraws from research evidence to claims about that evidence – and also reflex-ively back on the inferential construct itself. So McNamara tells us that testdata can be used to validate inferences we draw based on constructs, but also tovalidate the constructs themselves; while Chapelle distinguishes in vocabularyacquisition research between construct inference (from performance to vocab-ulary construct) and theory inference (from vocabulary construct to a theoryof vocabulary knowledge). Markee argues that CA work on classroom inter-action could provide the missing logical link from the accumulated evidenceprovided by experimental SLA research on negotiated interaction to theoreticalclaims about language learning.

The authors also problematize inference in a variety of ways. Larsen-Freeman reminds us that inferences are always provisional and partial, subjectto refutation as other data are considered; but she also points out that moredata do not necessarily lead to better inferences – after all, a large corpus mayincrease dependability of observed patterns, but may also reveal more varia-tions within the patterns; and it remains true that insightful inferences haveoften been drawn from one or only a few instances; furthermore, since lan-guage is always changing, it’s a moving target in any case. Markee muses on thegreatest inferential puzzle of SLA, namely the question of “how psycholinguis-tic questions of language learning intersect with sociolinguistic aspects of lan-guage use”; and he also raises the “significance” problem (following Schegloff),pointing out that conclusions drawn about a great number of instances of in-teractional repair are not useful if the categories of repair are ambiguous anddecontextualized.

Other authors demonstrate that different inferences can be drawn fromthe same data, based on one’s theoretical approach or conceptual frameworkand concomitant construct. Chapelle reminds us that construct definition forvocabulary assessment may not look exactly the same when undertaken fromtrait, behaviorist, or interactionist perspectives. Swain demonstrates that dif-ferent inferences may be drawn from verbal protocols by information process-ing theorists (who see language primarily as communication of thought) and



sociocultural theory of mind theorists (who see language as crucially impli-cated in human thinking).

The issues these authors highlight – sufficiency of data as a basis for infer-ence and the inferential relation between theory and data – constitute method-ological rich points which are if anything even more central in ethnographicapplied linguistics research, by definition interpretive. Interpretation is afterall a kind of inference, a search for patterns and understandings (Duff). In-terpretation, and induction, both terms used more often than inference inethnographic research, explicitly highlight the subjective involvement of theethnographer in mediating between theory and data, crucial to achieving twoof ethnography’s defining characteristics – a holistic and an emic view. Emic inthat the ethnographer attempts to infer the “native” point of view: to describethe culture, cultural situation, or cultural event as its members understand itand participate in it (i.e. as they make sense of it). Holistic in that the ethnog-rapher seeks to create a whole picture, one that leaves nothing unaccounted forand that reveals the interrelatedness of all the component parts (Hornberger1992:186, 1994:688).

The emic/etic distinction so often invoked in ethnographic research wasfirst proposed by Pike (1954), in direct parallel to the phonemic/phonetic dis-tinction in phonology. In the study of human behavior, the etic standpoint isone situated outside the system studied, in which units and classifications aredetermined on the basis of existing knowledge of similar systems, and againstwhich the particular system is measured; while the emic standpoint is onesituated inside the particular system studied, which views the system as an inte-grated whole, and in which units and classifications are determined during andnot before analysis, and are discovered and not created by the researcher. Bothstandpoints are necessary and it is the movement back and forth between themthat takes our understanding forward. Hymes speaks of Pike’s three moments,etic-1, emic, etic-2, in terms of a “dialectic in which theoretical frameworks areemployed to describe and discover systems, and such discoveries in turn changethe frameworks” (Hymes 1990:421). This dialectic movement from theory todata and back again is essential to the process of ethnographic interpretation,and it is the ethnographer who provides the inferential link.

In a recent essay on the development of conceptual categories in ethno-graphic research, Sipe and Ghiso emphasize the paradoxical nature of the in-terpretive process wherein “theoretical frameworks are essential to structuringa study and interpreting data, yet the more perspectives we read about, thegreater the danger of overdetermining conceptual categories and the ways inwhich we see the data” (Sipe & Ghiso 2004:473). Demonstrating a process in


Nancy H. Hornberger

which “induction and deduction are in constant dialogue” (Erickson 1986:21),Sipe and Ghiso provide a detailed example of a breakthrough in Sipe’s develop-ment of categories for his classroom data that came precisely from his readingBakhtin at the time. In a commentary on their essay, Erickson underlines thispoint, noting that if Sipe had been reading someone else, e.g. Fish, Foucault,or Habermas, the analysis might have gone in a different direction (Erickson2004:489).

Part of what drove Sipe to look for a further category in the first place wasthe existence of data that didn’t fit the categories he had worked out up to thatpoint – outlier data that he became increasingly uncomfortable categorizingas simply “off-task” (Sipe & Ghiso 2004:480). It was a question of sufficiencynot so much in the amount of data as the kinds of data that posed an infer-ential challenge for Sipe. Erickson comments on this, too, noting that whereasquantitas is always first about “what amounts?,” qualitas is about “what kinds?”(Erickson 2004:487). Grappling with the data that didn’t fit, the discrepantcases, Sipe achieved an interpretive breakthrough – and Erickson reinforcesthis point, emphasizing that Sipe’s example demonstrates that neither ethno-graphic data themselves nor interpretive themes and patterns simply emerge,but rather must be found by the researcher (Erickson 2004:486).

Erickson applauds Sipe & Ghiso’s demystification of ethnographic dataconstruction and analysis and takes the demystifying process one step furtherby considering alternative approaches to the “exhaustive analysis of qualita-tive data,” contrasting Sipe’s bottom up approach with a top down approachthat would “parse analytically from whole to part and then down again andagain, successively identifying subsequent next levels and their constituents atthat level of contrast [rather than] start by trying to identify parts first and thenwork up analytically from there” (Erickson 2004:491). He prefers the top downapproach himself in part because he thinks that is what social actors do, and inpart because it invites “parsing all the way down on both sides of [the] analyticdivide” (Erickson 2004:491). Whether bottom up or top down, the quest is forholism. It is, ultimately, the holistic and emic quality of the ethnographer’s ac-count that determines its validity and generalizability. In a similar move towarddemystifying data construction and analysis, Chapelle argues for the centralityof construct definition in vocabulary acquisition research and suggests that re-cent explicit discussion of vocabulary knowledge as a construct seems to markprogress toward a more generalized theory of L2 vocabulary acquisition. Theseallusions to generalizability bring us to the last part of our heuristic question.



Why do we research?

The goal of applied linguistics research is, at its most fundamental, to un-derstand and inform the teaching and learning of language. Through theirresearch, applied linguists seek to understand processes, inform policies, andtransform practices of language learning and teaching. In this vein, Larsen-Freeman identifies three purposes in her work: to inform the identificationof the language acquisition/language learning challenge; to better understandprocesses contributing to or interfering with meeting that challenge; and toadopt pedagogical strategies, design materials, and educate teachers, based onunderstandings of those challenges and processes. Markee seeks to explicate thecomplex phenomenon of how and why second languages are learned, Chapelleto evaluate both materials and tests used in L2 vocabulary learning and teach-ing, and Swain to shed light on the role of verbalization in language learningand in research on language learning. McNamara and Deville & Chalhoub-Deville look at issues around the construction, validation, and use of tests inlanguage learning and teaching.

In all of this and for all the authors herein, generalizability becomes an im-portant consideration. McNamara points out that, of all the fields of appliedlinguistics, language testing is probably where questions of inference and gen-eralizability are of most concern, since language tests are in essence proceduresfor generalizing. Yet it is also true that in every area of applied linguistics re-search, the overall aim of generalizability – i.e. “establish[ing] the relevance,significance and external validity of findings for situations or people beyondthe immediate research project” (Duff) – holds true. A number of distinctionsand related terms are discussed in the papers. Duff, following Firestone (1993),adopts the term analytic generalizability to refer specifically to generalizationat an abstract or theoretical level, as distinct from generalization at the level ofcases, populations or sites. Chapelle differentiates between a generalization in-ference (from performance on one task to performance on other similar tasks)and an extrapolation inference that generalizes from performance on the task(or tasks) to performance beyond the task. She notes that generalization infer-ence is the same as dependability; and Deville & Chalhoub-Deville likewise usegeneralizability, dependability, and reliability interchangeably.

Concerns about generalizability raised by the authors revolve in particu-lar around notions of variability and abstraction. Deville & Chalhoub-Devilleemphasize that variability is critical to any discussion of reliability (and gen-eralizability); Duff warns that generalizability may be inadvertently reducedby key sociocultural contextual variables, regardless of claims made by re-


Nancy H. Hornberger

searchers. Larsen-Freeman tellingly depicts that the price for increased gen-eralizability is increased abstraction, to the point that one loses any sense ofthe particularity of the case; and Swain problematizes the extent to which wecan generalize from inferences drawn from local and situated contexts. In ad-dition to Schegloff ’s domain and significance problems mentioned in earliersections, Markee discusses Schegloff ’s denominator and numerator problems,arguing that the quantified search for generalizability is premature and possiblymisleading when it is not preceded by adequate specification of the analyticallyrelevant denominator (e.g. repairs per task type) and numerator (e.g. absenceas well as presence, rarity as well as frequency of repairs). With regard to proce-dures for generalizing from tests, McNamara addresses challenges arising fromsocial science critiques of an overly abstract positivism, in relation on the onehand to the particular values implied within test constructs and on the otherto the particular social and political uses made of test scores.

These concerns around variability and abstraction constitute anothermethodological rich point in applied linguistics research, one that ethnogra-phers tend to grapple with in terms of transferability and particularity. Trans-ferability, as Duff points out, assigns responsibility to readers to determinewhether findings apply to another context; variability across contexts is takenfor granted, but if the ethnographer provides enough rich and detailed de-scription of one local context, it should be possible for the reader familiar withanother local context to sort out what findings might or might not transfer.In that regard, the greater the particularity of description and interpretation(and the less the abstraction), the more likely it is that a reader will be able todetermine whether these particular findings apply to another context. In thissense, Duff suggests that in qualitative research, the goal is not generalization orprediction but rather a search for particularity – what Geertz famously called“thick description” (Geertz 1973).

Final reflections

Across all the methodological rich points these essays highlight for me, thereis a common thread of recognition of the existence of multiple possible actors,multiple possible trajectories, and multiple possible truths (Duff) in appliedlinguistics research. This is all to the good, not because they should or could allbe combined in some grand “mixed methods” research project; I am not muchfor mixed methods approaches anyway, partly because I don’t think they’reparticularly new and partly because in my experience there is usually one dom-



inant method subsuming the others. Rather, the value of multiple researchactors, trajectories, and truths is that they all offer different versions of andperspectives on the vast language learning and teaching enterprise. To the de-gree that these multiple approaches share common methodological rich pointssuch as those highlighted here, they are all potentially enriched by dialogueacross and within the applied linguistics field.

Notably, along with inference and generalizability, these authors have alsotalked about credibility and contextualization in relation to research questions,collaboration and representation in relation to research participants, hetero-geneity and co-construction in relation to research methods, demystificationand ecology in relation to analytical approaches. Underpinning all these arecritical concerns that go beyond inferring or generalizing findings to trans-forming the realities they describe; there is increasingly explicit attention topower and inequality and the role of research and of the researcher in interro-gating those. As Agar puts it in relation to critical ethnography, “you look atlocal context and meaning, just like we always have, but then you ask, why arethings this way? What power, what interests, wrap this local world so tight thatit feels like the natural order of things to its inhabitants?” (Agar 1996:26). Or toparaphrase Pennycook on discourse analysis in applied linguistics research, thecritical question becomes not only what language means and how that meaningis constructed across sentences, but also why those particular meanings out ofall possible available meanings are expressed at that particular moment in timeand place (Pennycook 1994:116). As research increasingly locates communica-tive practices as parts of larger systems of social inequality (Gal 1989:347), itbecomes natural to ask what we, as researchers, can do about transformingthose practices and those inequalities.

As applied linguists set about that task in our multiple and varied researchendeavors, there are two more methodological rich points which to me seembasic – humility and respect. Humility before the rich diversity of languagelearning and teaching practices and contexts we have the privilege to observeand seek to understand, and respect for the language teachers, learners, andusers, both individuals and communities, who untiringly and insightfully plytheir language and pedagogical knowledge and skills, day in and day out theworld over.

I close with an example of ethnographic work in language and educationthat epitomizes many of the methodological rich points highlighted above.Pippa Stein recounts experiences with two projects she has worked on withpre-service and in-service language teachers in Johannesburg, both of whichencourage students’ use of a range of representational resources in their mean-


Nancy H. Hornberger

ing making, including the linguistic mode in its written and spoken forms,the visual, the gestural, the sound, and the multimodal performance (Kress& Van Leeuwen 1996; Kress 1997). A reflective practitioner, she is exploring“pedagogies in which the existing values attached to representational resourcesare reconfigured to take into account a broader notion of semiotic resources”(Stein 2004:37), specifically with the goal of ascribing equal value to resourcesbrought by historically advantaged and historically disadvantaged students.Both the Performing the Literacy Archive Project and the Photographing Liter-acy Practices Project focus on literacy because “issues of literacy and access to itare at the heart of educational success in South African schools” (2004:37) butin them the students explore “the use of multiple semiotic modes in the makingof meaning” (2004:37). Drawing on her reflections and on written and videodocumentation of the students’ work over the several years she has done theseprojects with language teachers, she shows how these pedagogies “work withwhat students bring (their existing resources for representation) and acknowl-edge what [historically disadvantaged] students have lost” (2004:50). As sheputs it, it is “the saying of the unsayable, that which has been silenced throughloss, anger or dread, which enables students to re-articulate their relationshipsto their pasts. Through this process of articulation, a new energy is producedthat takes people forward. I call this process of articulation and recovery re-sourcing resources” (2004:39). Stein’s reflective practitioner account is abouttransforming language teaching and learning; that, I believe, is what appliedlinguistics research is most fundamentally about.

Notes

* This paper is based (very loosely) on my keynote talk at the 2000 Conference on Qual-itative Research in Education at (then) Rand Afrikaans University in Johannesburg, SouthAfrica. I would like to express my appreciation to Elizabeth Henning, who invited me andthereby got me started thinking on many of these issues, and whose own ethnographic workin teacher education (e.g. Henning 2000) beautifully expresses humility of stance and respectfor persons in her research context.

. Methodological rich points are akin to Eisenhart’s (2001) muddles in educational ethnog-raphy. She discusses three, which are also reflected here to some degree: (1) the meaning ofculture in postmodern times, (2) the increasing popularity of ethnographic research acrossdisciplines, along with the backlash against it, and (3) the ethics of representation.



References

Agar, M. (1996). Ethnography reconstructed: The professional stranger at fifteen. In M.Agar, The professional stranger (pp. 1–51). New York, NY: Academic Press.

Arkoudis, S. & Creese, A. (Eds.). (2005). Teacher teacher talk in multilingual contexts.International Journal of Bilingual Education and Bilingualism. Special issue.

Brutt-Griffler, J. & Varghese, M. M. (Eds.). (2004). Bilingualism and language pedagogy.Clevedon, UK: Multilingual Matters.

Cameron, D., Frazer, E. et al. (1992). Researching language: Issues of power and method.London: Routledge.

Canagarajah, A. S. (2005a). Accommodating tensions in language-in-education policies:An Afterword. In A. Lin & P. Martin (Eds.), Decolonisation, globalisation: Language-in-education policy and practice. Clevedon, UK: Multilingual Matters.

Canagarajah, A. S. (Ed.). (2005b). Reclaiming the local in language policy and practice.Mahwah, NJ: Lawrence Erlbaum.

Creese, A. & Martin, P. (Eds.). (2003). Multilingual classroom ecologies: Inter-relationships,interactions and ideologies. Clevedon, UK: Multilingual Matters.

Cronbach, Lee J. (1982). Designing Evaluations of Educational and Social Programs. SanFrancisco: Jossey-Bass.

Delgado-Gaitan, C. (1990). Literacy for empowerment: The role of parents in children’seducation. London: Falmer Press.

Eisenhart, M. (2001). Educational ethnography past, present, and future: Ideas to think with.Educational Researcher, 30(8), 16–27.

Erickson, F. (1986). Qualitative methods in research on teaching. In M. C. Wittrock (Ed.),Handbook of research on teaching (3rd ed., pp. 119–161). New York: Macmillan.

Erickson, F. (2004). Demystifying data construction and analysis. Anthropology andEducation Quarterly, 35(4), 486–493.

Erickson, F. & Shultz, J. (1982). Counselor as gatekeeper: Social interaction in interviews. NewYork, NY: Academic Press.

Firestone, W. A. (1993). Alternative arguments for generalizing from data as applied toqualitative research. Educational Researcher, 22, 16–23.

Fishman, J. A. (1971). The sociology of language: An interdisciplinary social scienceapproach to language in society. In J. A. Fishman (Ed.), Advances in the sociology oflanguage (pp. 217–237). The Hague: Mouton.

Freeman, R. D. (2004). Building on community bilingualism. Philadelphia, PA: CaslonPublishing.

Gal, S. (1989). Language and political economy. Annual Review of Anthropology, 18, 345–367.Geertz, C. (1973). The interpretation of cultures: Selected essays. New York, NY: Basic Books.Giddens, A. (1976). New rules of sociological method. London: Hutchinson.Giddens, A. (1984). The constitution of society: Outline of the theory of structuration. Berkeley,

CA: University of California Press.Gumperz, J. J. (1982). Discourse strategies. Cambridge: CUP.Hawkins, M. R. (Ed.). (2004). Language learning and teacher education: A sociocultural

approach. Clevedon, UK: Multilingual Matters.


Nancy H. Hornberger

Heath, S. B. (1983). Ways with words: Language, life and work in communities and classrooms.New York, NY: CUP.

Heath, S. B. (2000). Linguistics in the study of language in education. In B. M. Brizuela, J.P. Stewart, et al. (Eds.), Acts of inquiry in qualitative research (pp. 27–36). Cambridge,MA: Harvard Educational Review.

Heath, S. B. & McLaughlin, M. (Eds.). (1993). Identity and inner-city youth: Beyond ethnicityand gender. New York, NY: Teachers College Press.

Heller, M. & Martin-Jones, M. (Eds.). (2001). Voices of authority: Education and linguisticdifference. Norwood, NJ: Ablex.

Henning, E. (2000). Walking with ”barefoot” teachers: An ethnographically fashionedcasebook. Teaching and Teacher Education, 16, 3–20.

Hornberger, N. H. (1988). Bilingual education and language maintenance: A southernPeruvian Quechua case. Berlin: Mouton.

Hornberger, N. H. (1992). Presenting a holistic and an emic view: The Literacy in TwoLanguages project. Anthropology and Education Quarterly, 23, 160–165.

Hornberger, N. H. (1994). Ethnography. TESOL Quarterly, 28(4), 688–690.Hornberger, N. H. (1995). Ethnography in linguistic perspective: Understanding school

processes. Language and Education, 9(4), 233–248.Hornberger, N. H. (Ed.). (1996). Indigenous literacies in the Americas: Language planning

from the bottom up. Berlin: Mouton.Hornberger, N. H. (2001). Educational linguistics as a field: A view from Penn’s program on

the occasion of its 25th anniversary. Working Papers in Educational Linguistics, 17(1/2),1–26.

Hornberger, N. H. (Ed.). (2003). Continua of biliteracy: An ecological framework foreducational policy, research and practice in multilingual settings. Clevedon, UK:Multilingual Matters.

Hull, G. & Schultz, K. (Eds.). (2002). School’s out! Bridging out-of-school literacies withclassroom practice. New York, NY: Teachers College Press.

Hymes, D. H. (1964). Introduction: Toward ethnographies of communication. AmericanAnthropologist, 66(6), 1–34.

Hymes, D. H. (1968). The ethnography of speaking. In J. A. Fishman (Ed.), Readings in thesociology of language (pp. 99–138). The Hague: Mouton.

Hymes, D. H. (1972). Models of the interaction of language and social life. In J. Gumperz &D. Hymes (Eds.), Directions in sociolinguistics: The ethnography of communication (pp.35–71). New York, NY: Holt, Rinehart, and Winston.

Hymes, D. H. (1990). Epilogue to ’The things we do with words’. In D. Carbaugh(Ed.), Cultural communication and intercultural contact (pp. 419–429). Hillsdale, NJ:Lawrence Erlbaum.

Johnson, D. C. (2004). Language policy discourse and bilingual language planning. WorkingPapers in Educational Linguistics, 19(2), 73–97.

Knobel, M. (1999). Everyday literacies: students, discourse, and social practice. New York, NY:Peter Lang.

Kress, G. (1997). Before writing : Rethinking the paths to literacy. London: Routledge.Kress, G. & van Leeuwen, T. (1996). Reading images: The grammar of visual design. London:

Routledge.



Lin, A. & Martin, P. (Eds.). (2005). Decolonisation, globalisation: Language-in-educationpolicy and practice. Clevedon, UK: Multilingual Matters.

Martin-Jones, M. & Jones, K. (Eds.). (2000). Multilingual literacies: Reading and writingdifferent worlds. Amsterdam: John Benjamins.

McCarty, T. L. (Ed.). (2005). Language, literacy, and power in schooling. Mahwah, NJ:Lawrence Erlbaum.

McLaughlin, D. (1992). When literacy empowers: Navajo language in print. Albuquerque,NM: University of New Mexico Press.

Page, R. (2000). The turn inward in qualitative research. In B. M. Brizuela, J. P. Stewart, etal. (Eds.), Acts of inquiry in qualitative research (pp. 3–16). Cambridge, MA: HarvardEducational Review.

Pérez, B., Flores, B., & Strecker, S. (2003). Biliteracy teacher education in the U.S.Southwest. In N. H. Hornberger (Ed.), Continua of biliteracy: An ecological frameworkfor educational policy, research, and practice in multilingual settings (pp. 207–231).Clevedon, UK: Multilingual Matters.

Pennycook, A. (1994). Incommensurable discourses? Applied Linguistics, 15(2), 115–138.Philips, S. U. (1983). The invisible culture: Communication in classroom and community on

the warm springs reservation. New York, NY: Longman.Pike, K. (1954). Emic and etic standpoints for the description of behavior. In K. Pike (Ed.),

Language in relation to a unified theory of the structure of human behavior (pp. 37–72).The Hague: Mouton.

Rampton, B. (2000). Speech community. In J. Verschueren, J. Östman, J. Blommaert, & C.Bulcaen (Eds.), Handbook of pragmatics 1998 (pp. 1–34). Amsterdam: John Benjamins.

Ricento, T. K. (Ed.). (2006). An introduction to language policy: Theory and method. NewYork, NY: Blackwell Publishing.

Ricento, T. K. & Hornberger, N. H. (1996). Unpeeling the onion: Language planning andpolicy and the ELT professional. TESOL Quarterly, 30(3), 401–428.

Schegloff, E. A. (1993). Reflections on quantification in the study of conversation. Researchon Language and Social Interaction, 26, 99–128.

Sipe, L. R. & Ghiso, M. P. (2004). Developing conceptual categories in classroom descriptiveresearch: Some problems and possibilities. Anthropology and Education Quarterly,35(4), 472–485.

Spener, D. (Ed.). (1994). Adult biliteracy in the United States. Washington, DC: Center forApplied Linguistics.

Stake, R. (2000). Case studies. In N. Denzin & Y. Lincoln (Eds.), Handbook of qualitativeresearch (pp. 435–454). Thousand Oaks, CA: Sage.

Stein, P. (2004). Re-sourcing resources: Pedagogy, history and loss in a Johannesburgclassroom. In M. R. Hawkins (Ed.), Language learning and teacher education: Asociocultural approach (pp. 35–51). Clevedon, UK: Multilingual Matters.

Tollefson, J. W. (Ed.). (1995). Power and inequality in language education. New York, NY:CUP.

Tollefson, J. W. (Ed.). (2002). Language policies in education: Critical issues. Mahwah, NJ:Lawrence Erlbaum.

Tollefson, J. W. & Tsui, A. B. M. (Eds.). (2004). Medium of instruction policies: Which agenda?Whose agenda? Mahwah, NJ: Lawrence Erlbaum.


Nancy H. Hornberger

Wallace, A. F. C. (1961). Culture and personality. New York, NY: Random House.Wortham, S. & Rymes, B. (Eds.). (2003). Linguistic anthropology of education. Westport, CT:

Praeger.

JB[v.20020404] Prn:22/02/2006; 11:40 F: Lllt12ai.tex / p.1 (48-190)

Author index

AAdams, L.

Adams, R.

Adler, P.

Agar, M. , ,

Aitchison, J.

Allport, F. H.

Altheide, D. L.

American Educational ResearchAssociation [AERA]

American PsychologicalAssociation [APA]

Anshen, F.

Appel, G.

Ard, J. ,

Arkoudis, S. ,

Aronoff, M.

Aston, G.

Atkinson, J. M.

BBachman, L. F. , , , , ,

, , , , , , , ,

, , , , , , ,

, , , , ,

Bailey, N.

Bard, E.

Barlow, M.

Barnes, D.

Baxter, G. P.

Beeckmans, R.

Benjamin, A.

Biber, D. , ,

Bickhard, M. H.

Biklen, S. K.

Block, D.

Boden, D.

Bogdan, R. C.

Bolinger, D.

Boorsboom, D.

Borg, W. R.

Borg, W. T.

Boucher, J. ,

Boustagui, E.

Boyd, R.

Bracht, G. H.

Breen, M.

Breivik, L.

Brennan, R. L. ,

Brewer, J.

Brezinski, K. L.

Brindley, G. ,

Brock, C.

Bromley, D. B. , ,

Brown, A. , , ,

Brown, J. D.

Brutt-Griffler, J.

Buck, G.

Burrows, C.

Burt, M.

Burton, E.

Button, G.

CCameron, D.

Campbell, D. T.

Canagarajah, A. S. ,

Canale, M.

Capps, L.

Carroll, S.

Carruthers, P. ,

Carter, R.

Castaños, F.

Celce-Murcia, M. , , ,

, ,

Chalhoub-Deville, M. , , ,

–, , , , , ,

, , , ,

Chalmers, D.

Chapelle, C. A. , , , , ,

, , –, , , ,

, –, , , ,

, , , , , ,

, , , ,

Chapin, L.

Charter, R. A.

Chaudron, C.

Chomsky, N.

Christensen, L. B.

Chun, D. M.

Churchland, P. M.

Clapham, C. ,

Clark, A. , ,

Clayman, S.

Coady, J.

Cohen, A.

Cole, M.

Conrad, S. ,

Cook, V.

Coughlan, P.

Coulthard, M.

Council of Europe ,

Crabtree, B.

Crandall, J.

Creese, A. ,

Creswell, J. ,

Creswell, J. W. ,

Cronbach, L. J. , , , ,

, ,

Crookes, G. –, ,

Crooks, T.

Crutcher, R. J.

Csizér, K.

Cumming, A. , ,

Curtiss, S.

DDörnyei, Z.

Damasio, A. R.


Author index

Davies, A. , ,

Davis, K. ,

Day, R.

DeKeyser, R.

Delgado-Gaitan, C.

Dennett, D. C.

Denzin, N.

Deville, C. , , , , , ,

, , , , , ,

, , ,

Dillon, J. T.

Donato, R.

Donmoyer, R. ,

Doughty, C. –

Douglas, D. ,

Drew, P.

Duff, P. A. , , , , , ,

, –, –, , , ,

, , , , , ,

, –, , ,

–, –, –,

,

Dufranne, M.

Dulay, H.

Dunbar, S. B.

Duranti, A.

EEades, D.

Early, M. ,

Edge, J. , ,

Eisenhart, M.

Eisner, E.

El Tigi, M.

Ellis, R. , ,

Embretson, S.

Erickson, F. , ,

Ericsson, K. A. –, ,

Eychmans, J.

FFairbank, J. K.

Falodun, J.

Feifel, H.

Feldt, L. S. ,

Firestone, W. A. , ,

Firth, A. , –

Fishman, J. A. ,

Fiske, D. W.

Foster, P.

Fraenkel, J. R.

Francis, G.

Freeman, R. D.

Frisbie, D. A.

Frota, S. ,

GGal’perin, P. Ya. , ,

Gal, S. , , ,

Gall, J. P. –,

Gall, M. D. –,

Gaskill, W.

Gass, S. , , , , , ,

, –, , , ,

, , , ,

Geertz, C.

Ghisletta, P.

Ghiso, M. P. ,

Giddens, A.

Givón, T. ,

Glass, G. V.

Goa, X.

Goldman, M.

Goodwin, C. ,

Gregg, K. R. , ,

Griffin, P.

Gruba, P.

Guba, E. G. , ,

Gumperz, J. J.

HHaccius, M.

Hammersley, M. ,

Hamp-Lyons, L. , ,

Harklau, L. ,

Harley, B.

Hart, W. D.

Hatch, E. , , ,

Hawkins, B. ,

Hawkins, M. R. ,

Haynes, M.

Hazenberg, S.

Headland, T. N.

Heath, C.

Heath, S. B. ,

Heider, K. G.

Heller, M. ,

Henning, E. ,

Heritage, J. , –

Herron, C.

Hill, K.

Hogan, T. P.

Holliday, A. , ,

Holunga, S.

Hornberger, N. H. , , ,

, , ,

Huang, L.

Huberman, A. M. , , ,

,

Huckin, T. ,

Huebner, T. ,

Hull, G.

Hulstijn, J. ,

Hunston, S.

Hunter, A.

Hymes, D. , , ,

IInagaki, S.

International Language TestingAssociation [ILTA] ,

Ioup, G.

Iwashita, N. , ,

JJacoby, S.

Janssens, V.

Jefferson, G. ,

Johnson, D. C.

Johnson, D. M.

Johnson, J. M.

Johnson, M. ,

Johnson, R. B. ,

Jones, K.

Jordan, G.

KKanagy, R.

Kane, M. , , , , , ,

, , ,

Kantor, R.

Kasper, G. –, ,

Kemmer, S.

Kennedy, G.

Kim, K.-H. –

Kirk, J.

Knobel, M.

Knowles, P.

Kobayashi, M.


Author index

Koenig, J. A.

Kormos, J.

Koschmann, T.

Koshik, I.

Kouritzen, S.

Kowal, M.

Kramsch, C. , ,

Krashen, S. –,

Krathwohl, D. –,

Kress, G.

Kruse, H.

Kuehn, T.

Kunnan, A. ,

Kunnan, A. J. ,

LLabov, W. ,

Lado, R. L.

Lakoff, G.

Langacker, R.

Lantolf, J. , , , ,

Lapkin, S. , , , , ,

Larsen-Freeman, D. , , ,

, –, , –, ,

, , , , , –,

–, , , ,

Larsen-Freeman, L.

Laufer, B. ,

Lazaraton, A. , , , ,

Lee, G. ,

Lee, Y.-W. ,

Leki, I. ,

Lerner, G. H.

Leung, C.

Leutner, D.

Lewis, L.

Li, D. , ,

Liddicoate, A.

Lightbown, P. M. , ,

Lin, A.

Lincoln, Y. , , ,

Linn, R. L. ,

Linnville, S.

Liskin-Gasparro, J. E.

Little, D.

Liu, G.

Llosa, L. , ,

Long, M. H. , –, ,

,

Lorge, I.

Lowe, P.

Lumley, T.

Luria, A. R. ,

Lynch, B. , , , , ,

,

Lyster, R.

MMacken, M.

Mackey, A. , , , , ,

, ,

MacWhinney, B.

Madden, C.

Magnusson, D. ,

Markee, N. , , , –,

, , , , , ,

, , , , –,

–, , , ,

Marr, D.

Martin, P. , ,

Martin-Jones, M. ,

Mathison, S.

Matthiessen, C. M. I. M.

Maxwell, J. A. ,

May, L.

Mayer, R. E.

McCarthy, M.

McCarty, T. L.

McDonough, K.

McGroarty, M.

McHoul, A. , ,

McKay, S. L.

McLaughlin, D.

McLean, M.

McNamara, T. F. , , , ,

, , , , , , ,

, , , , , , ,

, , , , , ,

, , , ,

Meara, P. , ,

Meehl, P. E. ,

Mehan, H.

Mehnert, U.

Mehrens, W. A. ,

Merriam, S. ,

Messick, S. , , , –,

, , , , , , ,

,

Michigan Corpus of SpokenAcademic English

Miles, M. , , , ,

Miller, A.

Miller, M. L.

Miller, W.

Mincham, L.

Mislevy, R. J. , , , ,

Mollaun, P.

Morgenthaler, L.

Mori, J.

Morita, N.

Moselle, M.

Moss, P. A. , , ,

Muranoi, H.

NNabei, T. , ,

Nassaji, H.

Nation, I. S. P. , , , ,

National Council onMeasurement in Education[NCME]

Negueruela, E.

Nesselroade, J. R.

Neuman, W. L.

Newman, D.

Newmeyer, F.

Nicholas, H.

North, B. , , , , ,

Norton, B.

Nunan, D.

OOchs, E. , ,

Ockey, G. J.

Ohta, A. ,

Oliver, R. ,

Oller, J. W., Jr. ,

Olsher, D.

Omaggio, A. C.

Onwuegbuzie, A. J. ,

Ortega, L.

PPérez, B.

Page, R.

Palincsar, A.

Palmberg, R.

Palmer, A. S. , , , , ,

, ,

Palys, T.

Pankhurst, J.


Author index

Parkes, J.

Peck, S.

Pennycook, A. ,

Peräkylä, A.

Perkins, K.

Peshkin, A.

Philips, S. U.

Philp, J.

Pica, T. , ,

Pike, K. L.

Plass, J. L.

Polio, C. , ,

Popham,W. J. ,

Porter, P. A. ,

QQualls, A. L.

RRampton, B. , , ,

Ranta, L. ,

Read, J. , , ,

Reppen, R. ,

Ricento, T. K.

Richards, K. , ,

Richardson, L.

Robertson, D.

Robinson, J. D.

Robinson, P. J.

Rogers, T. S.

Ross, S.

Roth, A. L.

Rulon, K. A.

Rutherford, W.

Rymes, B.

SSacks, H. ,

Salica, C.

Salomon, G. ,

Sasaki, M. ,

Sato, C. ,

Schachter, J.

Schegloff, E. A. , , ,

–, , , ,

Schmidt, R. W. , , , ,

,

Schmitt, D. ,

Schmitt, N. –

Schneider, G.

Schofield, J. W. , , ,

Schultz, K.

Schumann, J. , , ,

Schwanenflugel, P. L.

Schwartz, J.

Searle, J. R.

Seedhouse, P.

Selinker, L. ,

Sharwood Smith, M.

Shavelson, R. J.

Shehadeh, A.

Shoben, E. J.

Shohamy, E. ,

Silverman, D.

Simon, H. A. –, ,

Sinclair, J.

Singleton, D. ,

Sipe, L. R. ,

Skehan, P. , ,

Slade, D.

Smagorinsky, P. –,

Snow, R. E. –

Sokolov, A.

Sorace, A.

Spada, N. ,

Spener, D.

Spolsky, B.

Stake, R. , , , , ,

Stanley, J.

Stein, P. ,

Sternberger, J. P.

Stivers, T. ,

Swain, M. , –, , , ,

, , –, , ,

, , , , , ,

–, , , , ,

, , , –, ,

, ,

TTalyzina, N. –,

Tarone, E. , ,

Tashakkori, A.

Teddlie, C.

Telchrow, J. M.

Tollefson, J. W.

Tomasello, M.

Toohey, K. ,

Toulmin, S. E.

Trim, J. L. M.

Tsui, A. B. M.

Tulviste, P.

Tyler, L. K.

UUchida, Y.

VValsiner, J.

Van de Velde, H.

Van Leeuwen, T.

Van Lier, L. , ,

VanPatten, B.

Varela, C.

Varghese, M. M.

Varonis, E. M. , , ,

Votaw, M. C.

Vygotsky, L. S. –,

WWagner, J. , –

Wallace, A. F. C.

Wallen, N. E.

Watson-Gegeo, K.

Weigle, S. C. ,

Wells, G.

Wertsch, J. ,

Wertsch, J. V. ,

Wesche, M.

Wessels, J.

White, J.

White, L. ,

Whitehead, A. N.

Widdowson, H. ,

Willett, J.

Willey, B.

Williams, H. ,

Williams, J.

Wong, P.

Wong, S. C.

Wortham, S.

Wray, A.

YYi’an, W.

Yin, R. , ,

Yip, V.

Young, R. F. , , ,

Yu, C. H.

ZZimmerman, C.

Zimmerman, D. H.

JB[v.20020404] Prn:28/02/2006; 13:07 F: LLLT12SI.tex / p.1 (48-143)

Subject index

AAlpha coefficient

American Association ofApplied Linguistics (AAAL),

Applied Linguistics , –, ,

, , , , , , , ,

, , , , , –, ,

, –, , , –,

–, –, ,

–, , , , ,

, –, , , ,

–

BBehavior , –, , , ,

, , , , , , ,

, , ,

Behaviorist , , , ,

Bilingual ,

CC-test

Case study , , , , ,

, , , , , , ,

Instrumental case study ,

Intrinsic case study , ,

Chains of evidence

Chaos and complexity theory

Claim , , , , , ,

, , , , –,

,

Classroom Assessment , ,

Co-construct , , , , ,

, , ,

Code of Ethics

Cognitive , , , , , ,

, , –, , , ,

, , , , , ,

,

Collaboration , ,

Common EuropeanFramework ,

Communicative languageability , ,

Comparability , , ,

Complementary data sources, , , , ,

Connectionism ,

Consequences , –, ,

, , , , –,

–, , , –,

, ,

Consistency , , , ,

–, ,

Consistent , , , , ,

, , , , ,

Construct definition , , ,

, –, , , ,

, ,

Behaviorist , , ,

,

Interactionalist , –,

–,

Construct inference , , ,

,

Construct irrelevant variance,

Construct validation ,

see also Validity

ContextAbility-in-language

user-in-context , ,

,

Theory of context ,

Contextual analysis ,

–, ,

Converging evidence ,

see also TriangulationConversation

Conversational act

Conversational repair ,

, –,

Institutional talk , ,

, , , ,

Conversation analysis , , ,

, , , , , , ,

, , ,

Conversation analyticperspective , ,

Credibility , , , , , ,

, –, , , , ,

, ,

Criterion domain , , ,

DDependability –, , , ,

–, , , , , , ,

, , , , , ,

, ,

Dependable , , , , ,

, ,

Depth , , , , , ,

DiscourseAcademic , , ,

Classroom , , , ,

, ,

Domain

Hypothesis


Subject index

Michigan Corpus of spokenAcademic Discourse(MICASE)

Validation theory

Discrete/embedded , , ,

, ,

Domain , , , , , ,

, , , , , ,

, ,

EEcological approach

Effect , , , , , , ,

, , , , , ,

,

Differential , ,

General , –, , , ,

, , , , , , , ,

, , , , , ,

, , , , , ,

, , ,

Emic , , , , , ,

, , ,

Empirical , , , –, ,

, , , , , , , ,

, , , , –, ,

, , , , , ,

–

Epistemology , , , ,

, , , ,

ErrorCase study

Construct , , ,

Correction

Context

Measurement , ,

Method ,

Scores

Performance

Ethnographer , , –,

, ,

Ethnographic research , ,

, , , ,

Ethnography , , , , ,

, , , , , , ,

, , , , , ,

Etic perspective ,

Evidence centered test design

Existential , –, , ,

,

FFeedback , , , , ,

, –, ,

GGeneralizability

see also Dependabilitysee also ReliabilityAnalytic , , , , , ,

, , , , , ,

, , ,

Analytic generalizability,

Optimal level , ,

–

Quantitative vs. qualitative

Quantification ,

Theoretical , , , , ,

–, –, , , ,

, , , , , , ,

, , , , , ,

, , , , , ,

, , , , ,

Theory

Transferability , , ,

, , , ,

Utility

Grammar , , , , ,

–, , , , ,

, , , ,

Functional grammar , ,

, , ,

Universal grammar

HHolism

HypothesisCritical period

Discourse

Formation

Input ,

Interaction –,

Output

IImmersion , , ,

Impact , , , , , , ,

, , , , , , ,

–,

Induction ,

InferenceAppropriateness , , ,

,

Context , , ,

Construct , , , , ,

–, , , , , ,

Criterion domain ,

Extrapolation , –, ,

,

Performance , , , ,

, , , , , , –,

, , , , ,

Provisional ,

Qualitative , , ,

Quantitative , , ,

Score , , , –, ,

,

Social/cognitive ,

Statistical , , ,

Theory , –, , , ,

Inferential processes

Integrative tasks

InteractionComplexity , , , ,

, , , , ,

Effect

Hypothesis –,

Independent , , , ,

, , , , , ,

,

Interdependent

Language development

Person and task –

Reciprocal ,

Social , , , , ,

, , , –

Transactional

Interactionalist , –,

–,

Interlocutor , , , –,

International English LanguageTesting System (IELTS)

International Language TestingAssociation (ILTA) ,

Introspection , ,


Subject index

Introspective method

Intuitional data –,

Item , , , , , , ,

,

LLanguage ability , , –,

, , , , , , ,

, ,

Language assessment –, ,

, , , , , , , ,

, , , , , –,

, , , , –,

Language learning , , ,

–, , , , , ,

, , , , , , ,

, , –, , ,

Language performance , ,

, , ,

Language proficiency construct

Language test , , , ,

–, , ,

Language testers as researchers,

Language testers as testers

Language use , , , , ,

, , , , , , ,

, , , , –,

, , –, , ,

, ,

LearningAcquisition ,

Community ,

Processes , , , ,

Social

Strategies ,

Lexicogrammatical ,

Linguistic corpora ,

MMeasurement

Educational measurement, , ,

Measurement error ,

Meta-analysis

Morphosyntax , ,

Multi-site ,

NNaturalistic , , , , ,

,

OObservation report –,

Observational data , ,

Observed behaviors , ,

Observed performance , ,

–, , , , ,

Observer , , , , ,

–, ,

Ontological stance , ,

, ,

Constructivist , , ,

, , ,

Operationalist ,

Realist –, ,

Ontology , , ,

–, ,

Dualism ,

Monism ,

Oral proficiency Interview(OPI) , ,

PParadigm , , , , , , ,

, , , , , , ,

Paradigm debate , ,

Pedagogy , , ,

Philosophy of science

Population –, ,

Positivism , , , ,

Positivist , , , , , ,

, , , , , , ,

Pragmatics , , ,

Productive knowledge ,

Psycholinguistic , , , ,

RRandom variation

Rasch –

Rater , , ,

Real-life ,

Recall Protocol

Reflective practitioner

Refugee , ,

Reliability , , –, , , ,

, , , , , , , ,

, , –, ,

Replication , , , ,

Representation , , , ,

, , , , ,

Representativeness , , ,

, , , , ,

Research approaches , ,

, , , , ,

Advocacy-based

Empowerment-oriented,

Research ethics ,

Research on, for, and withsubjects ,

Research use argument ,

, ,

Inferential link , , ,

Rich points , , , ,

–

Roles of language testers

SSampling , , , , , ,

Scope of inquiry ,

Second language acquisition ,

, , , , , , , ,

, , , , , , ,

, ,

Semantics , ,

Situation , , –, , ,

, , , , , , ,

, ,

Social constructionism ,

Sociocultural , , ,

–, , , , ,

,

Sociolinguistic , , , ,

, , , , –

Stimulated recall , ,

–, , –, ,

,

Strategic competence


Subject index

Strategies , , , , , ,

, , , , , , ,

, , ,

Structural word knowledge

TTask

Task-based languageteaching

Universe of

Test design , ,

Test method , ,

Test method variability

Test of English as a ForeignLanguage (TOEFL)

Test score , , , , , ,

–, , ,

Test taker , , , , , ,

, , ,

Test use , , , , , ,

Testlet

Test user

Theoretical constructframework , , , ,

, , ,

Theory-for-practice , ,

Thick description , , ,

, , , , , , ,

Trait , , , , , ,

, ,

see also ConstructTransfer , , ,

Triangulation , , , ,

–, , ,

see also Converging evidenceTypicality , , , , , ,

UUniverse of tasks

Unmotivated variation

VValidation , , , , –,

, , , , , , ,

,

Validation argument ,

Validity , –, , , , ,

–, –, , ,

–, , , , , ,

, –, , , ,

, ,

see also Construct validationConfirmability ,

Credibility , , , , ,

, , –, , , ,

, , ,

Ecological –, , ,

External –, , ,

–, , , , ,

Internal –, , –,

,

Meaningfulness , ,

, –, ,

Transferability , , ,

, , , , ,

Validity argument , , ,

,

Validity evidence , ,

Values , , , –, , ,

, , , , ,

Variability , , , , –, ,

, , ,

Variance , , , –,

Verbal protocol

Vocabulary

Vocabulary acquisition ,

–, –, , , ,

, ,

Vocabulary depth

Vocabulary knowledge–, –, , , ,

,

Vocabulary Levels Test

Vocabulary organization,

Vocabulary size , , ,

, , ,

Yes/No vocabulary test

W

Wh-clefts , , ,

Word association test

In the series Language Learning & Language Teaching the following titles have been published thus far or are scheduled for publication:

13 NORRIS, John M. and Lourdes ORTEGA (eds.): Synthesizing Research on Language Learning and Teaching. xiv, 341 pp. + index. Expected August 2006

12 CHALHOUB-DEVILLE, Micheline, Carol A. CHAPELLE and Patricia A. DUFF (eds.): Inference and Generalizability in Applied Linguistics. Multiple perspectives. 2006. vi, 248 pp.

11 ELLIS, Rod (ed.): Planning and Task Performance in a Second Language. 2005. viii, 313 pp.10 BOGAARDS, Paul and Batia LAUFER (eds.): Vocabulary in a Second Language. Selection, acquisition, and

testing. 2004. xiv, 234 pp.9 SCHMITT, Norbert (ed.): Formulaic Sequences. Acquisition, processing and use. 2004. x, 304 pp.8 JORDAN, Geoff: Theory Construction in Second Language Acquisition. 2004. xviii, 295 pp.7 CHAPELLE, Carol A.: English Language Learning and Technology. Lectures on applied linguistics in the age of

information and communication technology. 2003. xvi, 213 pp.6 GRANGER, Sylviane, Joseph HUNG and Stephanie PETCH-TYSON (eds.): Computer Learner Corpora,

Second Language Acquisition and Foreign Language Teaching. 2002. x, 246 pp.5 GASS, Susan M., Kathleen BARDOVI-HARLIG, Sally Sieloff MAGNAN and Joel WALZ (eds.): Pedagogical

Norms for Second and Foreign Language Learning and Teaching. Studies in honour of Albert Valdman. 2002. vi, 305 pp.

4 TRAPPES-LOMAX, Hugh and Gibson FERGUSON (eds.): Language in Language Teacher Education. 2002. vi, 258 pp.

3 PORTE, Graeme Keith: Appraising Research in Second Language Learning. A practical approach to critical analysis of quantitative research. 2002. xx, 268 pp.

2 ROBINSON, Peter (ed.): Individual Differences and Instructed Language Learning. 2002. xii, 387 pp.1 CHUN, Dorothy M.: Discourse Intonation in L2. From theory and research to practice. 2002.

xviii, 285 pp. (incl. CD-rom).

inference and generalizability in applied linguistics

Documents