language corpora in applied linguistics -...

Language corpora

1. Introduction

There have been considerable debates about whether corpus linguistics should be defined as a

new discipline, a theory or a methodology. While some linguists working with corpora tend to

think that corpus linguistics goes well beyond the purely methodological role by re-uniting the

activities of data gathering and theorizing, thus leading to a qualitative change in our

understanding of language (Halliday, 1993:24), more researchers (e.g., Botley & McEnery, 2000;

Leech, 1992; McEnery & Wilson, 2001: 2; Meyer, 2002:28; ) seem to share the view that corpus

linguistics is a methodology contributing to many hyphenated disciplines in linguistics. This

methodological nature has brought about the convergence of corpus linguistics with many

disciplines, of which applied linguistics probably enjoys the most benefits corpus linguistics can

offer.

2. Empiricism, corpus linguistics, and electronic corpora

The philosophical opposition between rationalism and empiricism, according to Robins

(1997), has almost always been there throughout the history of linguistics. The former can be

dated back to Rene Descartes (1596-1650), who emphasized that introspection is often necessary

when acquiring knowledge, hence his formula ‘I think, therefore I am’. The latter is often

represented in the ideas of John Locke (1632-1704), who believed that the mind starts out as a

‘blank slate’, which we fill with experience and generalizations based on experience (Hudson,

1999). Among other things, the major difference between the two lies in how generalizations

should be made. So far as linguistics is concerned, the former holds that great ideas about

language are generated on the basis of intuitions of linguists working in the ‘armchair’ (hence

‘armchair linguists’. See Fillmore, 1992:35), while the latter believes that generalizations derive

from observations and analysis of large amounts of data (hence ‘corpus linguists’ or ‘field

linguists’. See Fillmore, 1992:35). The controversy between rationalism and empiricism has lasted

a long time, with the ‘pendulum’ swinging back and forth at different times in history.

1

The 1950s witnessed the peak of empiricism in linguistics, with Behaviorists dominating

American linguistics, and Firth, a leading figure of British linguistics at the time, forcefully

advocating his data-based approach with his well-cited belief that “You shall know a word by the

company it keeps” (Firth, 1957). However, with the quick advent of generative linguistics in the

late 1950s, empiricism gave way to Chomskyans, whose approach is based on the intuition of

‘ideal speakers’ rather than collected performance data. It was not until the 1990s that there was

again a resurgence of interest in the empiricist approach (Church & Mercer, 1993).

Corpus linguistics is empirical. Its object is real language data (Biber et al., 1998; Teubert

2005). In fact, there is nothing new about working with real language data. In as early as the

1960s, when generative linguistics was at its peak, Francis and Kucera created their one-million-

word corpus of American written English at Brown University (Francis and Kucera, 1982). Then,

in the 1970s, Johansson and Leech at the University of Oslo and the University of Lancaster

compiled the Lancaster-Oslo/Bergen Corpus (LOB), the British counterpart of the American

Brown Corpus. Since then, there has been a boom of corpus compilation. Spoken corpora (such as

the London-Lund Corpus, see Svartvik, 1990) and written corpora, diachronic corpora (such as

ICAME) and synchronic corpora, large general corpora (such as the BNC and the Bank of

English) and specialized corpora, native speakers’ language corpora and learner corpora and many

other types of corpora were created one after another. These English corpora are constantly

providing data to meet the various needs of applied linguistics.

After a few decades of world-wide corpus-related work, several tendencies now seem to be

obvious in corpus building in the new empiricist age.

First, modern corpora are becoming ever larger and more balanced, and therefore often

claimed to be more representative of the language concerned. As of December 2004, the size of

the Bank of English was reported to have reached 524 million words1. More interestingly, Keyhoe

& Renouf (2002) have made use of the ever growing world-wide-web, and built the Webcorp, an

ever-growing corpus of the language on the web.

Second, there is often the need for rationalism and empiricism to work together. One case in

point is that corpus builders now seem to be following their intuitions when they decide on the

data to be collected. Similarly, in corpus-based research, hypotheses do not simply emerge by

themselves --- the researcher has some hypotheses in her mind before she explores her data. There

2

is now a tendency that intuition-based approaches and data-based approaches are becoming more

compatible.

Third, many types of specialized language corpora are being built to serve specific purposes.

To meet the needs of different research and pedagogic projects, for example, certain components

of large general corpora may be extracted to make up specialized corpora, and some researchers

manage to create their own specialized corpora, without which their research goals can be hard to

come by.

Fourth, many researchers in applied linguistics have realized the usefulness of corpora for the

analysis of learner language. As native language transfer in second language learning has become

a well-established belief, researchers around the world have created corpora of English produced

by learners of different linguistic and cultural backgrounds. Such work is most typically

represented by the ICLE (International Corpus of Learner English) project headed by S. Granger

at Universite Catholique de Louvain, Belgium (Granger et al., 2002).

Finally, as corpus annotation is believed to bring “added value” (Leech, 2005) to a corpus,

there is a tendency that corpus annotations are becoming more refined, providing detailed

information about lexical, phraseological, phonetic, prosodic, syntactic, semantic, and discourse

aspects of language.

3. Applications of corpora in applied linguistics

In order to provide an overall picture of the extensive use of corpora in applied linguistics,

this section will first present an overview of the use of corpora in applied linguistics. Following

the overview, our discussion will focus on a few major areas of applied linguistics where corpora

are playing increasingly important roles, namely, syllabus design, data-driven learning, language

testing, and interlanguage analysis.

3.1 An overview of the use of corpora in applied linguistics

Leech (1997) summarizes the applications of corpora in language teaching with three

concentric circles.

3

Figure 1: The use of corpora in language teaching (from Leech, 1997)

Drawing on Fligelstone (1993), Leech (1997) claims that ‘the direct use of corpora in

teaching’ (the innermost circle) involves teaching about [corpora], teaching to exploit [corpora],

and exploiting [corpora] to teach. According to Leech (1997), teaching about corpora refers to

the courses in corpus linguistics itself; teaching to exploit corpora refers to the courses which

teach students to exploit corpora with concordance programs, and learn to use the target language

from real-language data (hence data-driven learning, to be discussed in more detail later in this

section). Finally, exploiting corpora to teach means making selective use of corpora in the

teaching of language or linguistic courses which would traditionally be taught with non-corpus

methods.

The more peripheral circle, ‘the use of corpora indirectly applied to teaching’, according to

Leech (1997), involves the use of corpora for reference publishing, materials development, and

language testing. In reference building, the Collins (now HarperCollins), Longman, Cambridge

University Press, Oxford University Press and many other publishers have been actively involved

in the publication of corpus-based dictionaries, electronic corpora, and other language reference

resources, especially those in the electronic form. As Hunston states, “corpora have so

revolutionised the writing of dictionaries and grammar books for language learners that it is by

now virtually unheard-of for a large publishing company to produce a learner’s dictionary or

grammar reference book that does not claim to be based on a corpus” (Hunston, 2002:96) (For

more detailed accounts of the use of corpora in dictionary writing, see Sinclair, 1987; Summers,

1996). In materials development, increasing attention is paid to the use of corpora in the

compilation of syllabuses (to be discussed later in this section) and other teaching materials. Also

in the second of the three concentric circles is language testing (See Section 3.4 for more detail),

4

Direct use of corpora in teaching

Use of corpora indirectly applied to teaching

Further teaching-oriented corpus development

which “benefits from the conjunction of computers and corpora in offering an automated, learner-

centred, open-ended and tailored confrontation with the wealth and variety of real-language data”.

Finally, the outermost circle, ‘further teaching-oriented corpus development’, involves the

creation of specialized corpora for specific pedagogical purposes. To illustrate the need to build

such specialized corpora, Leech (1997) mentions LSP (Language for Specific Purposes) corpora,

L1 and L2 developmental corpora (also an important data source for Second Language

Acquisition research), and bilingual/multilingual corpora. Leech (1997) believes these are

important resources that language teaching can benefit from.

Of course, Leech’s (1997) discussion focuses on the applications of corpora in language

teaching. Second Language Acquisition research, one of our major concerns in this book, is not as

important for him. In fact, corpus-based approaches to SLA studies, particularly to interlanguage

analysis, have become one of the most prevalent research methodologies for the analysis of

learner language. This issue will be brought to a more detailed discussion later in the chapter.

3.2 The Lexical Syllabus: a corpus-based approach to syllabus designing

The notion of a ‘lexical syllabus’ was originally proposed by Sinclair and Renouf (1988), and

further developed by Willis (1990), who points out a contradiction between syllabus and

methodology in language teaching:

The process of syllabus design involves itemising language to identify what is to be learned.

Communicative methodology involves exposure to natural language use to enable learners to

apply their innate faculties to recreate language systems. There is an obvious contradiction

between the two. An approach which itemises language seems to imply that items can be learned

discretely, and that the language can be built up from an accretion of these items. Communicative

methodology is holistic in that it relies on the ability of learners to abstract from the language to

which they are exposed, in order to recreate a picture of the target language. (Willis, 1990: viii)

The lexical syllabus attempts to reconcile the contradiction. Rather than relying on a clear

distinction between grammar and lexis, the lexical syllabus blurs the distinction and builds itself

on the notion of phraseology.

As early as a few decades ago, some researchers (e.g., Nattinger and DeCarrico, 1989; 1992;

Pawley & Syder, 1983) came to realize that phrases (also called ‘lexical phrases’, ‘lexical

5

bundles’, ‘lexical chunks’, ‘formulae’, ‘formulaic language’, ‘prefabricated routines’, etc.) are

frequently used and are therefore important for the teaching and learning of English. Later

researchers such as Cowie (1998), Skehen (1998) and Wray (2002) are also convinced that a better

command of such phrases can improve the fluency, accuracy and complexity of second language

production. Unfortunately, most syllabuses for ELT are based on traditional descriptive systems of

English, which consist of grammar and lexis (Hunston, 2002). Such descriptive systems,

according to Sinclair (1991), fail to give a satisfactory account of how language is used. Besides,

combining the building blocks (lexis) with grammar does not result in a good account of the

phrases in a language. Against this background, Sinclair (1991) proposes a new descriptive system

of language, completely denying the distinction between grammar and lexis and putting

phraseology at the heart of language description.

The lexical syllabus is not a syllabus consisting of vocabulary items. It comprises

phraseology, which “encompasses all aspects of preferred sequencing as well as the occurrence of

so-called ‘fixed’ phrases” (Hunston, 2002:138). Such a syllabus, according to Hunston (2002),

differs from a conventional syllabus only in that the central concept of organization is lexis. To put

it simply, as a relatively small number of words in English account for a very high proportion of

English text, it makes sense to teach learners the most frequent words in the target language.

However, as learning words out of their context does not seem to be a good idea, a syllabus had

better not be designed in such a way that it only specifies the vocabulary items to be learned.

Rather than the lexis of the target language, phraseology (words with their most frequently used

patterns) is what is to be specified in detail in a lexical syllabus.

According to Sinclair & Renouf (1988), ‘the main focus of study should be on (a) the

commonest word forms in the language; (b) the central patterns of usage; (c) the combinations

they usually form’ (1988:148). This is exactly what is covered in Willis’ (1990) syllabus.

Willis (1990) argues that the lexical syllabus effectively addresses “the main focus of study”

mentioned in Sinclair & Renouf (1988): (a) Willis’ (1990) syllabus consists of three different

levels. Level I covers the most frequent 700 words in English, which, according to Willis (1990),

make up around 70% of all English text. Level II covers the most frequent 1,500 and Level III

covers the most frequent 2,500 words in English. (b) The lexical syllabus illustrates the central

6

patterns of usage of the most frequent words in English. Such patterns of usage were later

developed into a system of grammar called “pattern grammar” (Hunston & Francis, 2000). (c)

Typical combinations of word forms (collocations) from authentic language are highlighted to

provide information about the phraseology of the words concerned.

It can be seen that the lexical syllabus is not a syllabus which lists all the vocabulary items

learners are required to learn. To Willis (1990), “English is a lexical language”, meaning that

“many of the concepts we traditionally think of as belonging to ‘grammar’ can be better handled

as aspects of vocabulary” (Hunston, 2002:190). Conditionals, for example, could be handled by

looking at the hypothetical meaning of the word ‘would’. The most productive way to interpret

grammar, therefore, is as lexical patterning, rather than as rules about the sequence of words that

often do not work.

As pointed out in Hunston (2002:190), Willis’ most radical suggestion is that a syllabus can

consist of a corpus. If the syllabus is to be presented in the form of a list, as is often done with

most syllabuses, the lexical syllabus can be presented in the form of a list of lexical patterning of

the most frequent words in a language. The teacher’s and the course designers’ task is to find ways

to encourage the learner to actively engage with the material in the corpus.

However, the lexical syllabus, as proposed in Willis (1990), is not without challenges.

First, frequency is a useful factor to take into consideration when a syllabus is designed.

However, language learning is not that simple. Native language influence, cultural difference,

learnability, usefulness, and many other factors can all bring about some difficulty for language

learning. As Cook (1998:62) notes, “an item may be frequent but limited in range, or infrequent

but useful in a wide range of contexts”. Proponents of the Lexical Syllabus seem to have ignored

this fact. They believe their syllabus will work for learners of all linguistic and ethnic

backgrounds. To many researchers and practitioners of applied linguistics, such a belief is too

simplistic to account for complicated processes such as language learning.

Second, it is not an easy task to select a manageable yet meaningful number of items from the

entire vocabulary of a language for inclusion in a lexical syllabus. While it may be true that the

2,500 words in Willis’ syllabus can account for 80% of all English text, knowing 80% of the

words in a text does not guarantee good comprehension. In fact, most of the 80% are functional

words or delexicalized words, which do not have much content. In other words, the role of the

7

commonest words in language comprehension and production may not be as significant as the

high frequency counts suggest. Also, while it may be relatively easy to include a content word as

an entry in a lexical syllabus, functional words can be a big problem. Such a word entry may take

tens of pages long in the lexical syllabus. If the most frequent words are to be accounted in detail,

it is very likely that the syllabus will become a comprehensive guide to the usage of functional

words.

Finally, the size of the syllabus may also be a problem. Willis (1990), when creating his

“elementary syllabus”, had to work with huge amounts of data. He complained that “The data

sheets for Level I alone ran to hundreds of pages which we had to distil into fifteen units” (Willis,

1990:130). We just cannot help wondering how many thousands of data sheets would have to be

created if a syllabus had to be written for advanced learners.

3.3 Data-Driven Learning (DDL)

The use of corpora in the language teaching classroom owes much to Tim Johns (Leech,

1997), who began to make use of concordancers in the language classroom in as early as the

1980s. He then wrote a concordance program himself, and later developed the concept of Data-

Driven Learning (DDL) (See Johns, 1991).

DDL is an approach to language learning by which the learner gains insights into the language

that she is learning by using concordance programs to locate authentic examples of language in

use. The use of authentic linguistic examples is claimed to better help those learning English as a

second or foreign language than invented or artificial examples (Johns, 1994). These authentic

examples are believed to be far better than the examples the teachers make up themselves, which

unavoidably lack authenticity. In DDL, the learning process is no longer based solely on what the

teacher expects the learners to learn, but on the learners’ own discovery of rules, principles and

patterns of usage in the foreign language. Drawing from George Clemenceau’s (1841-1929) claim

that ‘War is much too serious a thing to be left to the military’ (quoted in Leech, 1997), Johns

believes that ‘research is too serious to be left to the researchers’ (Johns, 1991:2), and that

research is a natural extension of learning. The theory behind DDL is that students could act as

‘language detectives’ (Johns, 1997:101), exploring a wealth of authentic language data and

discovering facts about the language they are learning. In other words, learning is motivated by

8

active involvement and driven by authentic language data, whereas learners become active

explorers rather than passive recipients. To a great extent, this is very much like researchers

working in the scientific laboratory, and the advantage lies in the direct confrontation with data

(Leech, 1997).

Murison-Bowie, the author of the MicroConcord Manual, gives some very persuasive reasons

for using a concordancer:

… any search using a concordancer is given a clearer focus if one starts out with a problem in

mind, and some, however provisional, answer to it. You may decide that your answer was basically

right, and that none of the exceptions is interesting enough to warrant a re-formulation of your

answer. On the other hand, you may decide to tag on a bit to the answer, or abandon the answer

completely and to take a closer look. Whichever you decide, it will frequently be the case that you

will want to formulate another question, which will start you off down a winding road to who knows

where. (Murison-Bowie, 1993:46, cited in Rezeau, 2001:153)

It can be seen that in DDL, language learning is a hypothesis-testing process, in which

whenever the learner has a question, she goes to the concordancer for help. If what she discovers

coincides with the authentic language data shown in the concordance lines, her hypothesis is

tested to be true, and her knowledge of the language is reinforced; conversely, if the concordance

data contradict her hypothesis, she modifies it and comes closer to how the language should be

used.

Higgins (1991) mentions that classroom concordancing tends to have two characteristic

objectives: “using searches for function words as a way of helping learners discover grammatical

rules, and searching for pairs of near synonyms in order to give learners some of the evidence

needed for distinguishing their use” (p.92). The following is an example taken from Tim Johns’

Kibbitzer2, in which two near-synonyms are distinguished:

This Kibbitzer is based on a sentence from an essay by a Chinese-speaking postgraduate student of Tourism.

Original Revision

Although economic improvement may be caused by tourism,

the investment and operational costs of tourism must also be

considered.

Although tourism may lead to economic improvement, the

investment and operational costs of tourism must also be

considered

Is cause an appropriate verb to use here? And if not, what alternatives are there? In the course of the consultation we did a

9

'quick trawl' of cause*, with results similar to the following (the 'Deep Object' appearing after the verb in citations 1-20 and

before the verb in citations 21-30):

1 iumph, because the Home Secretary had been caused a few moments of discomfort. Here was one of th

2 return of the money. The main charges will cause a furore among MPs on the Commons Public Account

3 been carefully balanced, "which should not cause a haemorrhaging of building society deposits". !

4 pened, "the resultant explosion would have caused a major disaster involving a considerable loss

5 the east German city of Halle. The report caused a public outcry against neo-Nazi violence in Ge

6 y condition of the digestive tract. It can cause abdominal pain, severe diarrhoea, constipation,

7 oss domestic product. Such a measure would cause acute embarrassment to the UK, which would need

8 of VAT on domestic heat and lighting will cause administrative chaos unless it is defeated this

9 Temple, other journalists and athletes has caused anger and bitterness, Radford, who only took up

10 e trial Three men accused of conspiring to cause explosions on or before October 9, last year, an

11 main story has been the wet weather which caused flooding over parts of southern Britain, as wel

12 satisfaction. On one visit to the city he caused great offence by using the word "aesthetics".

13 y effect, if any, of legal sanctions is to cause grief and ignorance. The morality or immorality

14 nfall for February. Another tropical storm caused havoc over the Australian state of Queensland

15 realise, of course, that your actions will cause his death. But this is not why you remove the or

17 ed discussions of male harassment that may cause resentment among male workers. Often, experts sa

18 le buying ''impure'' substances that could cause serious physical harm. Dabbling, evidently, does

19 fs quoted by ministers. Unemployment could cause tension at home. "But my main worry is financial

20 ladder to office: "Gimme a job or I might cause trouble. "It is a time-honoured tactic, so that,

21 woman joins the team to replace a casualty caused by a frozen sprout, brings in a murder mystery

22 a director, Chris Muir. "A delay has been caused by differences of opinion and interpretation be

23 the claim that the American Revolution was caused by disagreements over the doctrine of the Trini

24 d bribes for official contracts. The shock caused by Mr Portillo's 'exaggeration' was so great th

25 phed if the West had financed the hardship caused by Russia's speedy transition to a market econo

26 claims from people suffering from illness caused by stress at work, the Association of British I

27 will see a rebound from the heavy setback caused by the weakness of the Do-It-All joint operatio

28 rective of the terrible injuries and scars caused by this repulsive attack." In Bonn the federal

29 story for some people. Alarm is also being caused by tighter contracts being laid down by local a

30 wn to the department. Her brain damage was caused by whooping cough vaccine, for which the then

It is worth noting that

1. The Deep Objects ('Effects') of cause highlighted are overwhelmingly negative in connotation, from the relatively

mild discomfort (1) and resentment (17) through to the more powerful havoc (14) and death (15). The one

exception in this sample would appear to be the American Revolution (23), though, given the effect of all the other

contexts in which we find cause, we may suspect that the writer does not think that that was such a Good Thing

after all.

2. With this verb the relation between the Cause (which may be human - see citations 10, 12, 20 etc.) and Effect is relatively

direct and immediate.

Is there another verb in English that could take the place of cause, without its overwhelmingly negative connotations? We

decided to investigate lead to, which produced a set of citations similar to the following:

31 to ignite a civil war that would certainly lead to a complete destruction of Yemen as a single sta

32 r to Mrs Bottomley. "Further cuts can only lead to the loss of life." But the authority, which fa

33 cle to take them to court. 'This will also lead to a delay in getting a settlement and fixing thei

10

34 verseas to satisfy demand in Britain. That leads to a deterioration in Britain's trade balance, wh

35 not envisaged as a total ban, which would lead to accusations that the TCCB was merely maximising

36 ny retreat from free market policies could lead to economic collapse, and renewed confrontation wi

37 er, Thwaites warns that the decision could lead to further miscarriages of justice. "Anyone who th

38 mong the three classroom organisations may lead to more problems in schools than last year. Then,

39 n the 12th February 1938, a northerly gale led to severe flooding down the east coast. The Februar

40 arguments behind the present conflict also led to the first world war, which was ignited by the as

41 h good if his disconcerting outburst is to lead to a frank discussion about the future of Zionis

42 odle from this man's pen might, one feels, lead to a great discovery. Galileo is as global as hi

43 omic deregulation and free- market reforms led to a massive boom on the Bombay stock market; infla

44 ng and searching. It reports that this has led to a notable improvement in drug finds in the priso

45 e flexible working arrangements. This will lead to a relatively strong growth in productivity and

46 le His indomitable spirit was a force that led to great successes on the field in later years, cli

47 ence of any gains. Claims that trusts have led to greater efficiency are difficult to substantiate

48 to the customer. It is not as if subsidies lead to lower prices. If they did, then why is it that

49 vember. It concluded that the courses have led to significant benefits to parents in terms of con

50 tc) and to believe that free markets would lead to some sort of rural utopia is a delusion. As the

The results show:

1. The split between negative and positive (or at least non-negative) contexts is now 50/50 (this has been emphasised in the

printout above, citations 31-40 being negative, and 41-50 being positive). In other words, an expectation of a negative

result does not seem to be 'built into' lead to as it is built into cause. The approximately 50/50 split between negative and

positive contexts was confirmed from an examination of 500 contexts of lead to.

2. Apart from the difference in connotation, lead to differs both syntactically and semantically from cause:

● lead to is not used with a human subject, and does not appear in the passive.

● lead to is less direct than cause, implying a series of steps between cause and effect.

These features of lead to seemed to make it entirely suitable to replace cause, the bar on the passive making it necessary to

re-write the sentence in the active, as shown above.

What the foregoing example shows is that in data-driven learning, the learner often has a

question in mind. She then goes to explore the data for an answer. The whole process, as

mentioned before, involves testing a hypothesis with hands-on activities. In so doing, the learner is

more likely to be motivated.

It is worth mentioning that many of the features of DDL fit exactly into the constructivist

approach to language teaching, in which:

Learners are more likely to be motivated and actively involved;

The process is learner-centered;

Learning is experimentation;

Hands-on activities and experiential learning;

Teachers work as facilitators;

11

As pointed by Gavioli (2005:29), DDL raised several pedagogic questions, which we have to

answer when we encourage our students to engage themselves in data-driven learning:

1) if learners are to behave as data analysts, what should be the role of the teacher?

2) the work of language learners is similar to that of language researchers insofar as

“effective language learning is itself a form of linguistic research” (Johns 1991:2). So,

should we ask the learners to perform linguistic research exactly like researchers?

3) provided that learners adopt the appropriate instruments and methodology to actually be

able to perform language research, are the results worth the effort?

And, to this I would like to add Barnett’s (1993) note:

4) “the use of computer applications in the classroom can easily fall into the trap of leaving

learners too much alone, overwhelmed by information and resources”. What can we do to

improve this situation in data-driven learning?

3.4 Corpora in language testing

It is only recently that language corpora have begun to be used for language testing purposes

(Hunston, 2002), though language testing could potentially benefit from the conjunction of

computers and corpora in offering an “automated, learner-centred, open-ended and tailored

confrontation with the wealth and variety of real-language data” (Leech, 1997). In fact, the use of

corpora in language testing is such a new field of studies that Leech (1997) only mentions the

advantages of corpus-based language testing, and predicts that “automatic scoring in terms of

‘correct’ and ‘incorrect’ responses is feasible”. Hunston (2002:205) also states that the work

reported in her book is “mostly in its early stages”. Alderson (1996), in a paper entitled “Do

corpora have a role in language assessment?”, has to claim that he could only “concentrate on

exploring future possibilities for the application of computer-based language corpora in language

assessment” (Alderson, 1996:248).

To the best of my knowledge, the use of corpora in language testing roughly falls into two

categories. One is the use of corpora and corpus-related techniques in the automated evaluation of

test questions (particularly subjective questions like essay questions or short-response questions),

and the other is the use of corpora to make up test questions (mostly objective questions). The

reason behind this is simple: making up subjective questions does not take a lot of efforts but

12

evaluating them does; similarly, evaluating objective questions does not take a lot of efforts but

making up these questions may take a lot of time.

The earliest attempt to automate essay scoring was made several decades ago, when Ellis

Page and his collaborators devised a system called PEG (Project Essay Grade) to assign scores for

essays written by students (Page, 1968). From a collection of human-rated essays, Page extracted

28 text features (such as the number of word types, the number of word tokens, average word

length, etc.) that correlate with human-assigned essay scores. He then conducted a multiple

regression, with human-assigned essay scores as the dependent variable and the extracted text

features as independent variables. The multiple regression yielded an estimated equation, which

was then used to predict the scores for new essays. Following Page’s methodology, ETS produced

another automated essay scoring system called the E-rater, which has been used to score millions

of essays written in the GMAT, GRE and TOEFL tests (Burstein, 2003). Liang (2005) also made

an exploratory study, in which he extracted 13 essay features to predict scores for new essays. All

the above-mentioned studies involved the use of corpora and corpus-related techniques. They

reported good correlations between human-assigned essay scores and computer-assigned essay

scores. For more information about automated essay scoring, see Shermis & Burstein (2003).

The direct use of corpora in the automatic generation of exercises (or test questions) is also a

relatively new field of study. To the best of my knowledge, two studies have been reported in the

literature. Wojnowska (2002) reported a corpus-based test-maker called TestBuilder. The

Windows-based system extracts information from a tagged corpus. It can help teachers prepare

two types of single-sentence test items --- gapfilling (cloze) and hangman. While the system may

be useful for English teachers, the fact that the questions generated with the system can only be

saved in the pure text format greatly reduces its practicality. In other words, the test generated

allows no interaction with the test-taker, and the teacher or the test-taker herself has to judge the

test performance and calculate the scores manually. Besides, gapfilling and hangman are very

similar types of exercises, both involving the test-taker filling gaps with letters or combinations of

letters (words). Obviously, this system has not made full use of the capacity of the computer and

the corpora. Wilson (1997) reported her study on the automatic generation of CALL exercises

from corpora. Unfortunately, the study has not resulted in a computer program, and the exercises

13

generated are not interactive either.

To take better advantage of corpora for language testing purposes, we started a project of our

own. From this study a Windows-based program (called CoolTomatoes) has derived, which is

capable of automatically generating interactive test questions or exercises from any user-defined

corpus.

Program Description

In this study, computer programs were written to extract sentences from a user-defined corpus

(called Corpus File). The sentences extracted from the corpus contain the answer options defined

by the user in a separate file (called Answer Options File). The sentences extracted are to become

the stems for the multiple-choice questions, gapfilling questions or matching exercises, while the

answer options contain the right answer as well as the distractors. For example, an entry in the

Answer Options File may read “compatible with#consistent with#equal to#equvilent to#5”. This

informs the program that the user would like to have 5 questions (the user can define any one or

more types of questions) with the four “adj_prep” phrases as answer options. In case more than

five sentences in the corpus are found to match, the program will randomly choose 5 of them.

The stems and the answer options make up the questions, which are temporarily stored in a

text file called Questions File. This Questions File could be edited and/or printed, whichever is

necessary. This file, therefore, can provide the user with test questions for the traditional print

media (as does the TestBuilder). More importantly, the Questions File can be loaded (with the

“Load Questions” option) so that interactive test questions (in html format) can be automatically

generated.

The fact that both the Corpus File and the Answer Options File are user-defined and that the

Questions File can be further edited renders it possible to generate a needed number of user-

defined questions. While the whole process can be done without human intervention, it also allows

the user to make changes to the decisions made by the program.

In order for CoolTomatoes to be user-friendly, a GUI (graphical user interface) is also

provided (Figure 2). As shown in the figure, the user can choose to: 1) define the Corpus File and

the Answer Options File from the “Load Corpus” and “Load Answer Options” options; or, 2)

define an edited Questions File from the “Load Questions” option. After 1) or 2) is complete, the

14

user simply clicks the “Generate Multiple-Choice Test” button or the “Generate Cloze Test”

button. A set of interactive test questions will pop up. (Figure 3)

Figure 2: GUI of CoolTomatoes

More interestingly, given the fact that HotPotatoes (a product of Half-Baked Software Inc.) is

a much-used program for generating interactive exercises (though not from a corpus),

CoolTomatoes also has a user-friendly interface with HotPotatoes, which is available in the

“Export to HotPotatoes” tab. The user can choose to export his or her tests to any of the major

tools in the HotPotatoes program.

1 Information from http://www.arts-humanities.net/wiki/corpus_linguistics_linguistics, accessed September 2, 2007.2 Available at http://www.eisu.bham.ac.uk/Webmaterials/kibbitzers/kibbitzer24.htm, accessed September 2, 2007.

15

Figure 3: Interactive multiple-choice test generated with CoolTomatoes

CoolTomatoes has a number of useful features.

Questions are based on real language data. As the corpus comprises real language data,

the test questions generated by CoolTomatoes are naturally more meaningful. These

questions are more likely to test and enhance test-takers’ ability to use real language to

fulfill real-life tasks.

Question generation is fully customizable. Both the corpus and the answer options can be

customized. The user can choose to add a new text to an existing corpus, or choose to

load different corpora. This is important because test questions need to be at different

difficulty levels for different students, and the difficulty of the test questions is

determined by the difficulty of the corpus text. Customizing the answer options is also

important because every teacher/tester almost always wants to have their own questions

(sometimes for their own students) in the test.

Interaction with the test-maker. To eliminate chances of highly similar questions or poor

questions due to an inappropriate corpus, CoolTomatoes allows the user to preview the

questions to be generated. If necessary, the user can further edit, delete, and add

16

questions.

Dual output if needed: The user can choose to generate a print-out test (for the traditional

class) or an interactive test to be done on a stand-alone computer or on the web.

Real-time feedback for the test-takers: In an interactive test generated by CoolTomatoes,

test-takers are given real-time feedback about: 1) whether a question has been correctly

answered; 2) the overall performance of the test-taker (Figure 3).

Automatic submission of test performance: If a webserver and an email address are

defined, all test-takers’ test performance data can be automatically submitted to the email

address whereby learning can be monitored.

Suggestions

While the high customizability of CoolTomatoes brings considerable convenience to the

language teacher/tester and makes learning and testing more of a fun, a few cautions have to be

taken before a test can be released to the end-user learner.

First, the difficulty of the corpus on which to base the test questions has to be carefully

monitored. It is, needless to say, never rewarding to test your students against something that is far

beyond their ability or too easy to be worthwhile.

Second, the answer options have to be carefully chosen. If an achievement test is to be made,

for example, the answer options need to include the syntactic and/or lexical focuses the learners

have just been exposed to. If a proficiency test is to be made, several trials may be necessary to

determine what is going to be in the test. It may be a good idea to generate a large number of test

questions and manually filter out the less appropriate ones.

Third, for reasons of content-related language, it is always necessary to know what is in a test

before you ask your students to do it.

Finally, there is the danger of overwork on the part of the students. Learners’ motivation to

learn and to be tested should be cautiously preserved and encouraged. It is never a good idea to

drown them with tons of automatically generated exercises.

As seen from our study reported in this section, the advantages for corpus-based language

testing are obvious. As a corpus comprises collections of real-life language data, corpus-based

language tests or exercises are unmistakably genuine real-life samples. In the modern age when

17

authenticity of language use is greatly emphasized in language testing, corpus-based approaches

certainly enjoy prosperous future. The second advantage is that corpus-based language tests can be

automatically graded, since the questions are retrieved from corpora, and the right answers to

these questions are always readily extractable. Of course, with the questions and the answers in

the corpus, interactive test questions could be devised. One more advantage for corpus-based

language tests is speed. The computer can work faster than the best human assessor if it is told

exactly what to do.

Of course, a lot more remains to be explored in how language corpora can be appropriately

used in language testing, and whether subjective test items such as essay questions can be

automatically generated and graded. Besides, before the reliability and validity of these

automatically generated test questions are verified, it is questionable whether such questions are

suitable for high-stakes tests.

3.5 Corpus-based interlanguage analysis

While Svartvik (1996), perhaps not without hesitation, predicts that “corpus is becoming

mainstream”, Thomas & Short (1996), without any hesitation, contend that “corpus has become

mainstream”. More encouragingly, Čermák (2003) says that “it seems obvious now that the

highest and justified expectations in linguistics are closely tied to corpus linguistics”, and that

“it is hard to see a linguistic discipline not being able to profit from a corpus one way or

another, both written and oral”. Teubert (2005) concludes that “Today, the corpus is considered

the default resource for almost anyone working in linguistics. No introspection can claim credence

without verification through real language data. Corpus research has become a key element of

almost all language study.”

Research on interlanguage is no exception. As researchers in many linguistic disciplines are

enjoying the profits corpus linguistics constantly brings, SLA researchers also come to realize that

most of the methodologies (perhaps also some important notions about what language is) in

corpus linguistics also have important light to shed on interlanguage analysis.

Needless to say, corpus-based interlanguage analysis requires interlanguage corpora (or

learner corpora) to be collected. This is a relatively recent endeavor: it was not until the late 1980s

and early 1990s that academics and publishers started collecting learner corpora (Granger, 2002),

18

and corpus-based research on interlanguage began to be published when Granger and her

collaborators edited their collection of papers in a volume called Learner English on Computer

(Granger, 1998). Not surprisingly, Granger’s efforts have been followed by many SLA researchers

the world over, and the learner corpus has become “the default resource” for almost anyone

involved in interlanguage analysis.

According to Granger (2002), corpus-based interlanguage analysis usually involves one of the

two methodological approaches: Computer-aided Error Analysis (CEA) and Contrastive

Interlanguage Analysis (CIA).

Computer-aided Error Analysis

Computer-aided Error Analysis (CEA) evolved from Error Analysis (EA), which was a

widespread methodology for interlanguage analysis in the 1970s. EA has been criticized for

several weaknesses: EA is based on heterogeneous data and therefore often impossible to

replicate; EA often suffers from fuzzy error categories; EA cannot cater for avoidance; EA only

deals with what learners cannot do; EA cannot provide a dynamic picture of L2 learning.

According to Dagneaux et al. (1999), these weaknesses of EA highlight the need for a new

direction in EA analysis. Granger (2002:13) forcefully contends that CEA is a great improvement

on EA in that the new approach is computer-aided, involves a high degree of standardization, and

renders it possible to study errors in their context, along with non-erroneous forms.

The first way to conduct a computer-aided error analysis is to search the learner corpus for

error-prone linguistic items using a concordancer. The advantage is that it is very fast and errors

can be viewed in their context. However, phenomena such as avoidance and non-use remain a

problem.

The second way to conduct a computer-aided error analysis is more common and powerful.

This method requires a standardized error typology to be developed beforehand. This typology is

then converted to an error-tagging scheme, with conventions about what label is to be used to flag

each error type in the learner corpus. Then, experts or native speakers are employed to read the

data carefully, and label each error in the corpus with its corresponding tag in the error-tagging

scheme. This process is very time-consuming, but the efforts are rewarding. Once the tagging is

complete, concordancers can be used to search the corpus for each type of errors. The advantages

for this method are also obvious: error categorization is standardized; all errors are counted and

19

viewed in their context.

CEA as a corpus-based research methodology has been adopted by many interlanguage

analysts. Granger and her collaborators published several of their research articles, most of which

are collected in Granger (1998) and Granger et al. (2002).

In China, following Granger’s practice, Gui & Yang (2003) published CLEC (Chinese

Learners’ English Corpus), the first error-tagged learner corpus in China. With this error-tagged

corpus, many researchers in China conducted their own interlanguage research projects, and

hundreds of academic papers and monographs have been published. The research findings

convincingly revealed the typical errors Chinese EFL learners tend to make, offering important

implications for ELT in China.

Contrastive Interlanguage Analysis

Contrastive Interlanguage Analysis (CIA) involves comparing a learner corpus with another

corpus or other corpora. Very often, the purpose of such comparison is to shed light on features of

non-nativeness in learner language. Such features can be learner errors, but more often, they are

instances of overpresentation and underpresentation (Granger, 2002). That is to say, by means of

comparison, CIA expects to find out which linguistic items in learner language are misused,

overused or underused. Besides, CIA also attempts to discover the reasons for such deviations.

According to Granger (1998; 2002), CIA involves two types of comparison: NS/NNS

comparisons and NNS/NNS comparisons. NS/NNS comparisons are intended to reveal non-native

features in learner language. The assumption behind such comparisons, as can be figured out, is

that native speakers’ language can be regarded as a norm3, to which the learners are expected to

come closer in the process of language learning. The purpose of NNS/NNS comparisons is

somewhat different. Granger maintains that comparisons of learner data from different L1

backgrounds may indicate whether features in learner language are developmental (characteristic

of all learners at a certain stage of their language learning) or L1-induced. In addition, Granger

believes that bilingual corpora containing learners’ L1 and the target language may also provide

evidence for possible L1 transfer. It is worth mentioning that whichever comparison applies, a

statistical test is often necessary to reveal whether the difference found is statistically significant.

3 This methodology is drawing some criticism. Some linguists insist that any variety of a language has its own right to be a variety.

20

The CIA methodology has yielded abundant research findings. Most remarkably, Granger and

her collaborators published dozens of research articles, most of which are included in Granger

(1998) and Granger et al. (2002). Besides, the empiricist methodology in the new era has also

aroused the interest of SLA researchers the world over, who have been enthusiastically involved in

the creation of new learner corpora and corpus-based interlanguage studies, with the result that an

ever-growing number of such studies are published in renowned international journals. In China,

after the publishing of CLEC, two more learner corpora, namely, COLSEC (Yang & Wei, 2005),

and SWECCL (Wen et al., 2005) have been released, providing data for corpus-based

interlanguage studies in the Chinese context. It is particularly worth noting that SWECCL has two

components, SECCL (Spoken English Corpus of Chinese Learners) and WECCL (Written English

Corpus of Chinese Learners), with a total of about 1 million words each.

Undoubtedly, CIA is an effective methodology to reveal non-native features in learner

language. However, how to interpret non-native features in learner language seems to be a more

complicated matter, which cannot be simply attributed to L1 induction or developmental

patterning. Factors such as input may also come into play. It may be necessary to isolate sources

of observed effects (cf. Juffs, 2001). For example, Tono (2004) did an interesting study, in which

she used a multifactorial analysis to separate L1-induced variation from L2-induced variation and

learner input variation. Just as Cobb (2003) claims, learner corpus research ‘‘amounts to a new

paradigm, and a great deal of methodological pioneering remains to be done’’.

4. Future trends

Several trends now seem obvious in the use of language corpora in applied linguistics:

1. With the growth of the power of computer hardware and software and the increasing

availability of more texts, the sizes of general corpora may grow to billions of words. While such

large general corpora may still be necessary for our understanding of language, several types of

specialized corpora may be necessary in applied linguistics.

» For language learning and language teaching purposes, large is not always beautiful. It is

the usefulness of corpora that really matters. After all, too much may not be better than

enough. For a pedagogic corpus, size and representativeness are not as important as they are

for a general corpus. Neologisms, for example, may be interesting to researchers, but they

21

may not always be what we would like learners to learn. It is very likely that corpora for

learners to explore will be carefully controlled in terms of the typicality and difficulty of the

language therein. This will probably also be the corpus on which to base language testing,

though not necessarily for syllabus designing.

» For interlanguage researchers, representativeness of data will always be an important

consideration. However, for different research purposes, what is going to be represented

may also be radically different. Therefore, various types of specialized corpora may be

necessary. Researchers may have to build their own corpora to meet their specific needs. In

corpus-based interlanguage analysis, native speaker norms will not be put to death in at

least a few decades’ time.

2. Corpus-based methods will have to be supplemented with other methods (such as

experimentation) if their findings are to become more meaningful and convincing. Corpus

linguistics is virtually exclusively based on frequency (Gries, to appear). “The assumption

underlying basically all corpus-based analyses”, according to Gries, “is that formal differences

correspond to functional differences such that different frequencies of (co-)occurrences of formal

elements are supposed to reflect functional regularities” (ibid: 5). The problem lies in that corpus

methods exclusively rely on searchable formal elements. The result is that it can be difficult, if not

altogether impossible, to infer the cognitive process going on in the learners’ mind on the basis of

what can be found in text. In other words, corpus linguistics works on the product. It does not help

much when our interest is in the process. Whenever necessary, other methodologies may have to

be assorted to as supplements.

3. With Construction Grammar of Cognitive Linguistics becoming more widely accepted, interest

in the role of phraseology will keep growing. The idiom principle, which states that “a language

user has available to him or her a large number of semi-preconstructed phrases that constitute

single choices, even though they might appear to be analyzable into segments” (Sinclair

1991:110), will have an increasing effect on language teaching, interlanguage analysis, and

language research in general.

4. Interlanguage analysis will focus more of its attention on speech, and on syntactic aspects of

learner language. These aspects of interlanguage have by far not been adequately explored.

Besides, longitudinal studies may be especially rewarding. Just as Cobb (2003:403) claims, it is

22

“characteristic of learner corpus methodology to extrapolate from cross-sectional to longitudinal

data” in order to address developmental issues with respect to learner interlanguage. Such data

types are more likely to “accurately elucidate developmental pathways”.

Whether taken as a new discipline, a new theory, or a new methodology, corpus linguistics is

exerting influence on an ever-growing number of people in applied linguistics. These people

might virtually be anyone in the applied linguistics circles, from the syllabus designer to the

materials writer, from the language publisher to the dictionary user, from the teacher to the learner,

from the test maker to the test-taker, and from the researcher to the researched. No matter how

language corpora are used in applied linguistics, for whatever purposes, the central theme of

corpus linguistics remains the same: what is more frequent is also more important.

References

Alderson, J.C. (1996): ‘Do corpora have a role in language assessment?’in Thomas, J. & Short, M.

(eds). Using Corpora for Language Research. London: Longman.

Barnett, L. (1993). Teacher off: Computer technology, guidance and self-access. System 21 (3),

295-304.

Biber, D., S. Conrad, and R. Reppen. (1998). Corpus Linguistics: Investigating Language

Structure and Use. Cambridge: CUP.

Botley, S., and A. McEnery. (2000). Corpus-based and Computational Approaches to Discourse

Anaphora. Amsterdam: J. Benjamins. Studies in Corpus Linguistics, No. 3.

Burstein, J. (2003). The e-rater scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Cobb, T. (2003). Analyzing late interlanguage with learner corpora: Quebec replications of three

European studies. Canadian Modern Language Review, 59(3), 393-423.

Čermák, F. (2003). Today's Corpus Linguistics: Some Open Questions. International Journal of

Corpus Linguistics 7,2, 2003, 265-282.

Church, K. and B. Mercer. (1993). Introduction to the special issue of Computational Linguistics

Using Large Corpora. Computational Linguistics 19, 103-121.

Cook, G. (1998). The uses of reality: a reply to Ronald Carter. ELT Journal 52(1): 57-63.

Cowie, A.P. (1998). Phraseology. Theory, Analysis and Applications. Oxford: Clarendon Press.

23

Dagneaux, E., S. Denness, and S. Granger. (1998). Computer-aided error analysis. System 26: 163-

174.

Fillmore, C. (1992). 'Corpus Linguistics' or 'Computer-aided armchair linguistics' In: J. Svartvik

(ed.) (1992): (35-60).

Firth, J. (1957). A Synopsis of Linguistic Theory 1930-1955 In Palmer, F. (ed. 1968), Selected

Papers of J. R. Firth. Longman: Harlow.

Fligelstone, S. (1993). Some reflections on the question of teaching, from a corpus linguistics

perspective. ICAME Journal 17: 97-109.

Francis, W.N. and Kucera, H. (1982). Frequency Analysis of English Usage. Boston: Houghton

Mifflin.

Gavioli, L. (2005). Exploring Corpora for ESP Learning. Amsterdam: John Benjamins.

Granger S., Dagneaux E. and Meunier F. (2002). The International Corpus of Learner English.

Handbook and CD-ROM. Louvain-la-Neuve: Presses Universitaires de Louvain.

Granger, S. (1998). Learner English on Computer. London: Longman.

Granger S. (2002). A Bird's-eye View of Computer Learner Corpus Research. In Granger S., Hung

J. and Petch-Tyson S. (eds), pp. 3-33.

Granger S., Hung J. and Petch-Tyson S. (eds). (2002). Computer Learner Corpora, Second

Language Acquisition and Foreign Language Teaching. Amsterdam & Philadelphia:

Benjamins.

Gries, Th. (to appear). Corpus-based methods in analyses of SLA data. In: Robinson, Peter and

Nick Ellis (eds.), Handbook of cognitive linguistics and second language acquisition. New

York: Routledge, Taylor & Francis Group. Available at

http://www.linguistics.ucsb.edu/faculty/stgries/research/SLA_Constructions_Corpora.pdf.

[Accessed 2007-09-03]

Gui S.C. and H.Z. Yang. (2003). Chinese Learner English Corpus. Shanghai: Shanghai Foreign

Language Education Press.

Halliday, M.A.K. (1993). Quantitative studies and probabilities in grammar. In M. Hoey (ed.),

Data, Description, Discourse. London: Longman, pp. 1–25.

Higgins, J . (1991). Which concordancers : a comparative review of MSDOS software. System

19(1/2),91-100 .

24

Hudson, G. (1999). Essential Introductory Linguistics. Blackwell Publishing.

Hunston, S., & Francis, G. (2000). Pattern Grammar - A corpus-driven approach to the lexical

grammar of English. Amsterdam/ Philadelphia: John Benjamins.

Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: CUP.

Johns, T. F. (1991). Should you be persuaded: Two examples of data-driven learning. In Johns, T.

F. and King, P. (Eds.), Classroom Concordancing.. Birmingham: ELR, pp. 1-13.

Johns, T. F. (1994). From printout to handout: Grammar and vocabulary teaching in the context of

data-driven learning. In Odlin, T. (Ed.), Perspectives on Pedagogical Grammar. New York:

CUP, pp. 293-313.

Juffs, A. (2001). Verb classes, event structure, and second language learners' knowledge of

semantics-syntax correspondences. Studies in Second Language Acquisition 23.2:305-13

Jurafsky, D., A. Bell, M. Gregory, and W. D. Raymond. (2001). In: Bybee, Joan

Kennedy, G. (1998). An Introduction to Corpus Linguistics. London: Longman.

Keyhoe, A. & A. Renouf. (2002). WebCorp: Applying the Web to Linguistics and Linguistics to

the Web. In Proceedings of World Wide Web 2002 Conference, Honolulu, Hawaii, 7-11 May

2002.

Leech, G. (1992). Corpora and theories of linguistic performance. In J. Svartvik (ed), (1992).

Leech, G. (1997). Teaching and language corpora: a convergence. In Wichmann, A., S.

Fligelstone, T. McEnery, and G. Knowles, (eds), Teaching and Language Corpora. London:

Addison Wesley Longman.

Leech, G. (2005). Adding Linguistic Annotation. In Wynne, M. (ed.). pp. 17-29.

Liang, M. C. (2005). Constructing a model for the automated scoring of Chinese EFL learners’

argumentative essays. Unpublished PhD thesis, Nanjing University.

McEnery & Wilson, 2001: 2

Meyer, C. (2002). English Corpus Linguistics. Cambridge: CUP.

Nattinger, J. R., & J. S. DeCarrico. (1992). Lexical phrases and language teaching. Oxford: OUP.

Page, E.B. (1968). Analyzing Student Essays by Computer. International Review of Education, 14,

210–225.

Pawley, A. and H. Syder. (1983). Two puzzles for linguistic theory: nativelike selection and

nativelike fluency. In J. Richards and R. Schmidt (eds.), Language and communication.

London: Longman, pp. 191–226.

25

Rézeau J. (2001). "Concordances in the classroom: the evidence of the data". In Chambers A. &

Davies G. (eds.), Information and Communications Technology: a European perspective.

Lisse: Swets & Zeitlinger.

Robins, R. H. (1997). A Short History of Linguistics. London: Longman.

Shermis, M. D., & J. Burstein. (2003). Automated essay scoring: A cross-disciplinary perspective.

Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Sinclair, J. (1987). Grammar in the dictionary. In J. Sinclair. (ed), Looking Up: An Account of the

CoBuild Project. London: HarperCollins.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: OUP.

Sinclair, J. and A. Renouf. (1988). A lexical syllabus for language learning. In: R. Carter and M.

McCarthy (eds.), Vocabulary and language teaching. Harlow: Longman.

Skehen, P. (1998). A Cognitive Approach to Language Learning. Oxford: OUP.

Stubbs, M. (1996). Text and Corpus Analysis. Oxford: Blackwell.

Summers, D. (1996). Computer lexicography: the importance of representativeness in relation to

frequency. In Thomas & Short (1996).

Svartvik, J. (ed.). (1990). The London Corpus of Spoken English: Description and Research. Lund

Studies in English 82. Lund: Lund University Press.

Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics. Berlin: Mouton de Gruyter.

Svartvik, J. (1996). Corpora are becoming mainstream. In J. Thomas & M. Short (eds.), Using Corpora for Language Research. London: Longman.

Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1-13.

Thomas, J. and Short, M. (1996). Using Corpora for Language Research. London: Longman.

Tono, Y. (2002). The role of learner corpora in SLA research and foreign language teaching: the

multiple comparison approach. Unpublished Ph.D. thesis, University of Lancaster.

Tono, Y. (2004). Multiple comparisons of IL, L1 and TL corpora: the case of L2 acquisition of

verb subcategorization patterns by Japanese learners of English. In: Aston, G., S.

Bernardini, and D. Stewart (eds.). Corpora and language learners.

Amsterdam/Philadelphia: John Benjamins, p. 45-66.

Waara, R. (2004). Construal, convention, and constructions in L2 Speech. Cognitive linguistics,

second language acquisition and foreign language pedagogy. In by Achard, M. and S.

Niemeier, 51-75. Berlin: Mouton de Gruyter.

26

Wen, Q. F., L. F. Wang, and M.C. Liang. (2005). Spoken and Written English Corpus of Chinese

Learners. Beijing: Foreign Language Teaching and Research Press.

Willis, D. (1990). The lexical syllabus. London and Glasgow: Collins Co-build.

Wilson, E. (1997). The automatic generation of CALL exercises from general corpora. In

Wichmann, A., S. Fligelstone, T. McEnery, and G. Knowles, (eds), Teaching and Language

Corpora. London: Addison Wesley Longman.

Wojnowska, A.(2002). Corpus-based language test preparation and automation: the TestBuilder

Project. Unpublished BA diploma paper, School of English, Adam Mickiewicz University.

Wray, A. (2002). Formulaic language and the lexicon. Cambridge: CUP.

Wynne, M.. (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow

Books. Available online from http://ahds.ac.uk/linguistic-corpora/ [Accessed 2007-09-03].

Yang, H.Z. and N.X. Wei. (2005). Zhongguo Xuexizhe Yingyu Kouyu Yuliaoku Jiangshe yu

Yanjiu (The creation of and research on the College Learners’ Spoken English Corpus).

Shanghai: Shanghai Foreign Language Education Press.

27

language corpora in applied linguistics -...

Documents