corpus 3 corpus-based description. aspects of corpus-based studies lexis, morphology, syntax and...

35
Corpus 3 Corpus-based Description

Post on 21-Dec-2015

226 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Corpus 3

Corpus-based Description

Page 2: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Aspects of corpus-based studies

• lexis, morphology, syntax and discourse.

• fig. 3.1 A classification of corpus-based research on English

Page 3: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

lexical description • The most obvious use of corpora for lexical descri

ption is in lexicography.• Not only to identify the set of different words and s

how when new types enter the language, but to identify the various senses or uses of particular types and their relative frequencies.

• e.g. London-Lund Corpus: polysemous word good • Table 3.1• Identify neologisms

Page 4: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Pre-Electronic Lexical Description for

Pedagogical Purposes

• Thondike (1921): word frequency on the basis of 4.5 million word corpus of literary works and books read by younger children.

• The principle of vocabulary control in the design and editing of reading materials owes much to Thorndike's pioneering work.

• Michael West: General Service List of English Words (1953)

Page 5: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Pre-Electronic Lexical Description for

Pedagogical Purposes

• Description of the most frequent 2,000 words n the written English of the time, supplemented by information on the frequency of the meanings or uses of these words, based on the work of Lorge.

• Fig. 3.2 • Thorndike-Lorge corpus was biased towards more

literary and formal styles of writing, and did not include speech at all.

Page 6: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Computer-based studies of lexicon

• With a computerized corpus and appropriate software, both significant and more trivial but interesting facts about the lexicon of a language can be uncovered.

• Table 3.2 The rank ordering of the 50 most frequent words in various corpora shows remarkable consistency and systematic differences.

Page 7: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Computer-based studies of lexicon

Consistence: all the words except said are function words.

Page 8: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Word Occurrence 40% of the words in a corpus of over

five million words occur only once show that a corpus of even that size is not a sound basis for lexicographical studies of low frequency words.

Page 9: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Word Occurrence Sharman found that there was an almost line

ar relationship between vocabulary size and corpus size. A new word appeared in the text approximately every 30 words on average.

The more narrowly focused the corpus, the more content words find their way into the higher frequency levels.

Page 10: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Word Classes

• Table 3.5 (written English): Relative proportions of major word classes in the Brown and LOB corpora

• As shown in Table 3.6 (spoken English), fewer nouns and a considerable proportion of discourse items characteristic of spoken English are noteworthy.

Page 11: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Word Classes

• Table 3.7 shows that some sequences such as adjective + noun or noun + noun are very frequent indeed.

• Johansson and Hofland: occurrence of the 40 most frequent sequences of word-class tags at the beginnings and ends of sentences. Findings: the ends of sentences may be more predictable grammatically than the beginnings.

Page 12: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Register studies• Table 3.8 • There are certain characteristics of the vocabulary

of scientific English. Certain relational words are disproportionately more frequent in scientific English. Comparative adjectives and adverbs are similarly disproportionately frequent, whereas locative adverbs of space or time are disproportionately less frequent in scientific t4xts than in general written American English.

• Items witch occur in one variety but are highly unlikely to occur in the other.

Page 13: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Semantic information

• Longman Dictionary of Contemporary English

• noun entries 23,800

• 67% one sense 15946

• 20% two sense 4760

• 6.5% three senses 1547

• 2.5% four senses 595

Page 14: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Semantic information

• Verb 7921

• 55% one sense 4357

• 23.8% 2 senses 1885

• 10% three senses 792

• 4.4% four senses 348

Page 15: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Some words can have a tendency to occur in the company of other words in certain contexts, e.g. Pouring rain, statistically significant, intrinsic value, strong tendency

• Lexicalized unit: set phrase, idiomatic usage, cliché

Page 16: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Interest in recurring word combinations:

• Wong-Fillmore (1976): The strategy of acquiring formulaic speech is central to the learning of language.

Page 17: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Peters (1980): unanalyzed sequences of words had a significant role among the units of language acquisition and proposed ways for identifying such unanalyzed sequences.

• Nattinger and De Carrico (1992): since first language learners can be seen to use varying, apparently unanalyzed, prefabricated chunks of speech, then second language teaching might similarly be concentrated around the establishment of what they call lexical phrases.

Page 18: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Different characteristics of the sequence:

• Allow for no alteration: it's as easy as falling off a log.

• Allow certain changes (at the moment/at certain moments)

• Relatively free within a framework (too...to, n ... Of)

Page 19: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Problems in the definition of collocation:• How often does a combination have to recur to be

habitual?• Who decides what sounds natural?• Does a combination have t be well-formed or

canonical to be a collocation?• Do collocations have o be syntactic or are they

primarily semantic?• Do collocations have to consist of adjac4en words

or can they be discontinuous?

Page 20: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• can a sequence which occurs only once in a particular corpus but which is intuitively recognized by native speakers as a sequence they have heard before be listed as a collocation?

• How big does a corpus have to be in order to establish that a collocation does exist?

• Are there degrees of collocationality based on the flexibility of the bonding between words?

Page 21: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Can we lemmatize collocations so that similar or inflectionally related sequences are coned as a single collocation type?

• Are degrees of colocationality able to be established on the basis of the number of tokens of a type in a particular corpus

Page 22: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Collocation

• Sinclair(1991) suggested that a span of up to four words each side of a word is the environment in which collocation is most likely to occur although, of course, computer software makes it possible to explore much larger spans, including size of a whole text.

Page 23: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Tense and aspect of verbs

• Table 3.14 Rank order of the most frequent simple and complex finite verb forms

• Table 3.15 Relative frequencies of use of finite verb forms

• Table 3.16 Perfect and progressive verb forms in the Brown Corpus

• Table 3.17 Finite and non-finite verb forms• Table 3.18 past participle

Page 24: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Modals

• Tale 3.19 frequency of nine modals

• Table 3.20 use of models

• Table 3.21 use of modals in verb-phrase structures

Page 25: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Voice

• Table 3.22 active and passive predications

• Table 3.23 use of passives in different regisgters

• Table 3.24 verb-phrase structure of agentive passives

Page 26: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Verb and particle use

• Subjunctive

• Prepositions

• Conjunctions

Page 27: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Grammatical studies• Corpus-based grammatical studies revealed consid

erable genre differences in the use of syntactic patterns and in sentence length.

• Syntactic constructions are not in free variation.• Grammatical study is more of a challenge than lex

ical study because the tagging and parsing to facilitate the automatic analysis of texts and the development of softwares has not been widely available or user-friendly.

Page 28: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Sentence length

• Sentence length is related to genre. The mean number of words per sentence in Informative categories is much greater than imaginative prose.

• There is much closer consistency in the number of predications per sentence regardless of genre.

• Table 3.4.1 Sentence length and predications

Page 29: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Syntactic processes

• Clause patterning

• Table 3.42 Distribution of recurrent verb-complement patterns

• SVC (adj.) 45%

• SVO 20.9%

Page 30: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Syntactic processes

• About half of the clauses are matrix clauses and half are embedded. Of the matrix clauses,97.8% are finite, 1.5% are nonfinite, and 0.7% are elliptical.

• The vast majority of all informational subject clauses are extraposed (it is necessary that), reflecting a principle of end-focus from a functional sentence perspective or preferences in sentence organization for processing purposes.

Page 31: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Syntactic processes

• In informative prose the verb which precedes a finite that clause is more likely to be a communication verb such as say, state, whereas in spoken conversation affective or cooperative verbs such as think, fee, hope, tend to predominate.

Page 32: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Noun modification

• 98% of postmodifying clauses had one or other of the simpler clause patterns SVO(37%), SVO (38%), SVC (38%). Suggesting that embedding tend to favor less complex sentence patterns.

• 70% of noun phrases function as subjects or prepositional complements and noun phrases with postmodifying clauses tend to be disfavoured in subject functions.

Page 33: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Noun modification

• Postmodification is less frequent in nonfinal positions of sentences. This is because the subject or topic is familiar enough not to need identification or elaboration through postmodification, or because brief subjects are easier to process.

Page 34: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Causation

• The marking of causation can be lexicalized ( because, cause), syntactic structure (because of) or implicature.

• Choice for expressing causation is seldom free, but is influenced by various semantic, pragmatic, stylistic, cognitive and textual variables.

Page 35: Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based

Pragmatics

• Table 3.5 Distribution of discourse items

• Comparisons of spoken and writing English

• Table 3.58 pretty

• Table 3.59 really just right