eng 626 corpus approaches to language studies introduction (02) bambang kaswanti purwo...

23
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES introduction (02) Bambang Kaswanti Purwo [email protected]

Upload: amber-stewart

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

ENG 626CORPUS APPROACHES TO LANGUAGE STUDIES

introduction (02)

Bambang Kaswanti [email protected]

corpus linguistics

not a particular linguistic paradigm

• sociolinguistics• psycholinguistics• computational linguistics

a way of doing linguisticsa methodological basis for pursuing linguistic research

[Meyer 2002]

• theory-driven approach• data-based approach[to be elaborated later]

Corpus Linguistics first appeared early 1980

[McEnery et al. 2006]

corpus-based language studycorpus methodology

pre-Chomskyan periodBoas (1941); Sapir,Newman, Bloomfield, Pike

data storage shoeboxes filled with paper slipsrather than computers

▪ simple collections of written or transcribed texts ▪ not representative

corpus-based= empirical n based on observed data

late 1950s corpus methodology▪ severely criticized▪ marginalized▪ abandoned

the size of “shoebox corpora” very small

developments of powerful computers

▪ increasing power▪ massive storage▪ relatively low cost

corpus ▪ “any body of text” (McEnery and Wilson 2001), i.e. any collection of recorded instances of spoken or written lang.▪ a collection of texts or parts of texts upon which some general linguistic analysis can be conducted (Meyer 2002) ▪ any collection of texts, written or spoken, which is stored on a computer (O’Keeffe et al. 2007)

large amounts of texts can be stored and analyzed using analytical software

▪ collections of texts (or parts of texts) that are stored and accessed electronically (Hunston 2002)

linguistic theory and description

Chomsky’s three levels of adequacy: • observational adequacy • descriptive adequacy • explanatory adequacy

What does it mean if atheory or a description achieves observational adequacy?

It is able to describe which sentences in a language are grammatically well formed. (a) He studied for the exam. (b) *Studied for the exam.

descriptive adequacy:▪ not only describe▪ specify the abstract grammatical properties making the sentences well formed:Eng requires an explicit Subj

explanatory adequacy:use abstract principles applicable beyond the language under study universal grammar (UG)Eng, unlike Spanish or Indonesian,not a lang which permits “pro-drop”

Chomsky’s theory of principles and parameters

language acquisition: “the parameters of UG” vs. “the norms of the language being acquired”

pro-drop is a consequence of “null-subject parameter”

▪ speakers acquiring English set the parameter to negative▪ speakers acquiring Indonesian set the parameter to positive

generative grammar▪ emphasis is on universal grammar▪ explanatory adequacy a high priority

elements of a languagepart of the “core”

part of the “periphery”

▪ core▪ periphery

– “pure instantiations of UG”– “marked exceptions”

Generative Grammar (GG)a. little concern for variation in a languageb. variation is limited to nonsubstantive elements of the lexicon and general properties of lexical itemsc. (a) and (b) belong to the periphery of a languaged. only the elements that are part of the core are relevant for purposes of theory constructione. (d) is the idealist view of languagef. this is the goal of the minimalist theory, “a theory of the initial state”: a theory of what humans know about language “in advance of experience”e. the real world of the language and the complexity of the structure that comes out of it is not (yet) their concern

Corpus Linguistics (CL)f. (e) is what CL is interested in studyingg. complexity n variation are inherent in languageh. very high priority on descriptive, not explanatory adequacyi. CL very skeptikal of the highly abstract and decontextualized discussion of language (promoted by GG)j. such discussions too far removed from actual language use

the primary concern of• CL is an accurate description of language• GG is a a theoretical discussion of language that advances our knowledge of universal grammar

“formalists” (generative grammarians) vs. “functionalists”

functionalists are interested in • language as a communication tool• how speakers n writers use language to achieve various communicative goals

functionalists approach the study of language from a perspective different from formalists (generative grammarians)

formalists are interested in • describing the form of linguistic constructions• using these descriptions to make general claims about Universal Grammar (UG)

I made mistakes vs Mistakes were made by me [active] [passive]

generative grammarians are interested in • the structural changes in word order • making more general claims about the movement of constituents in natural language: the movement of NPs in English actives n passives is part of a more general process: “NP [noun phrase]–movement”

a functionalist is more interested in • the communicative potential of actives and passives in Eng • to study this potential, investigate the linguistic and social contexts favoring or disfavoring the use, e.g. a passive rather than an active construction

context: a politician embroiled in a scandal

of all these three possible constructions, which one to choose? (1) I made mistakes. (2) Mistakes were made by me. (3) Mistakes were made.

the agentless passive construction (3) allows him/her • to admit that something went wrong • [at the same time] to evade responsibility for the wrong- doing by being quite imprecise about exactly who made the mistakes

corpora consist of texts (or parts of texts) enable linguists to contextualize their analysis of language very well suited to more functionally based discussion of lg

(1) Jack gave a flower to Ann.(2) Jack gave Ann a flower.

(3) A flower was given to Ann by Jack.(4) Ann was given a flower by Jack.

(1) (2) “dative movement”; “preposition deletion”

“passivization”: (3) and (4) two different analyses ▪ (1) (3) ▪ (1) (2) (4)

syntactic analysis of generative grammarians

functional analysis

(1) Jack gave a flower to Ann.(2) Jack gave Ann a flower.

▪ what drives an English speaker to utter (1) instead of (2) or (2) instead of (1)?

▪ what questions triggers the speaker to say (1) instead of (2) or (2) instead of (1)?

sentence vs. utterance[- context] [+ context]

A1: What did Jack give to Ann?B1: Jack gave a flower to Ann.

B2: Jack gave Ann a flower.

A2: Whom did Jack give a flower to?

corpus ▪ a naturally occurring language▪ assembled with a particular purposes in mind▪ assembled to be representative of some language or text type▪ not a random collection of texts

representativea collection of pieces of language that are selected andordered according to explicit linguistic criteriain order to be used as a sample of the language(McEnnery et al. 2006)

a corpus is a collection of (1) machine-readable (2) authentictexts (including transcripts of spoken data) which is(3) sampled to be (4) representative of a particular languageor language variety

medium: spoken corpora (eg. London-Lund corpus) vs. written corpora (e.g. Lancaster Oslo/Bergen corpus (LOB)) vs. mixed corpora (British National Corpus (BNC) or Bank of English)national varieties: British corpora (e.g. Lancaster Oslo/Bergen corpus) vs. American corpora (e.g. Brown corpus) vs. an inter-national corpus of English.historical variation: diachronic corpora (Helsinki corpus, cf.the ICAME home page) vs. synchronic corpora (Brown, LOB, BNC) vs. corpora which cover only one stage of language his-tory (corpus of Old or Middle English, Shakespeare corpora)geographical variation/dialectal variation: corpus of dia-lect samples (e.g. Scots) vs. mixed corpora (The BNC spoken component includes samples of speakers from all over Britain)

corpus: a possible classification

age: corpora of adult English vs. corpora of child English (Eng-lish components of CHILDES)genre: corpora of literary texts vs. corpora of technical English vs. corpora of non-fiction (e.g. news texts) vs. mixed corpora covering all genresopen-endedness: closed, unalterable corpora (e.g. LOB, Brown) vs. monitor corpora (Bank of English)availability: commercial vs. non-commercial research corpora, online corpora vs. corpora on ftp servers vs. corpora available on floppy disks or CD-ROMs

Why corpus linguistics use computers to manipulateand exploit language data?

electronic corpora have advantages unavailable to their paper-based equivalents

▪ process and manipulate the data rapidly n easily (e.g. searching, selecting, sorting, n formatting)▪ process machine-readable data accurately and consistently▪ computers can avoid human bias in an analysis, making the result more reliable

The Brown corpus and the Lancaster-Oslo/Bergen corpus (LOB): Some well-known corpora from the beginnings of the computer age are the Brown corpus of written American English and the Lancaster-Oslo/Bergen corpus of written British English. The Brown corpus was compiled in the 60's [the first modern corpus of the English language], its British counterpart in the 70's. Both consist of around one million tokens (i.e. words, counted every time they appear).

The London-Lund corpus is another corpus of British English created around that time, but this corpus is different from the Brown and the LOB in that it exclusively contains transcripts from spoken material, collected at the Survey of English Usage at University College London. The London-Lund corpus, the Brown corpus, the LOB and other corpora are now available on CD-ROM as the ICAME collection of English texts. The Inter-national Computer Archive of Modern and Medieval English (ICAME), situated at Bergen in Norway, offers a wealth of information on these corpora.

The Bank of English was initiated in 1991 by COBUILD (a division of HarperCollins publishers) and the University of Birmingham. The main purpose of the Bank of English is and has been to provide a textual database for the compilation of dictionaries and for language studies. The Bank of English is a monitor corpus (i.e. new material is constantly added). By now the corpus has got a size of more than 320 million words.

The British National Corpus was compiled by a consortium of British publishers, of academic institutions such as Oxford Uni-versity Computing services, Lancaster University's Centre for Computer Research on the English language and the British Library. It is now a 100 million word corpus of modern British English, both written and spoken, including everyday conver-sations [a hundred times larger than the Brown corpus]. It is available on CD-ROM for research purposes; we have got a copy at our department.

The International Corpus of English (ICE) will ultimately be a collection of 1,000,000 word corpora from each country or region where English is spoken as a first language. The corpus consists of a written and a spoken component. The Survey of English Usage, situated at University College London, is respon-sible for this project. The home page of the Survey provides information on a variety of research projects, including the International Corpus of English (ICE).

The CHILDES system (mirror of the American site in Antwerp): This is the home page for the Child Language Data Exchange System (CHILDES). In particular, you'll find the CHILDES data-base, a collection of child language transcript data from a number of projects in different languages (including English and German).

The Bank of English – written and spoken English (used ex-tensively by researchers and for the COBUILD series of English language books)The BNC – written and spoken British English (used extensive-ly by researchers and for the Oxford University Press, Cham-bers and Longman publishing houses)CANCODE (Cambridge Nottingham Corpus of the Discourse of English)– spoken British English (used extensively by research-ers and Cambridge University Press)ICE (International Corpus of English– international varieties of spoken and written English (most of the corpus is not yet avail-able)

http://www.engl.polyu.edu.hk/corpuslinguist/corpus.htmExamples of English language corpora

Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus – parallel corpora of written texts (but now rather out-dated)London-Lund Corpus (Survey of English Usage)– spoken British English (used very extensively by researchers, but it is now quite old)Santa Barbara Corpus – spoken American English (most of the corpus is not yet available)Hong Kong Corpus of Spoken English (still being compiled, 1 million of the target 1,5 million words have been collected so far)ICAME (International Computer Archive of Modern English) – a centre which aims to coordinate and facilitate the sharing of computer-based corpora.

Online corporaExperimental BNC Website: Bad Guys Dont Look: The British National Corpus consortium currently offers a BNC online service which allows everyone with access to the internet to register for an account on the BNC server (free for twenty days unlimited usage)Shakespeare Online Corpus Concordance browsing : This site allows you to search a number of English literary classics, including the Bronte novels, Shakespeare and James Joyce's Ulyssees, with the help of the concordance program TactWeb. It is easy to use, even for absolute novices in the area.