corpus 06 discourse characteristics. reasons why discourse studies are not corpus-based: 1. many...

Post on 20-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Corpus 06

Discourse Characteristics

• Reasons why discourse studies are not corpus-based:

• 1. Many discourse features cannot be identified automatically. E.g. known/new information

• 2.  Analysis tools are not helpful.

• Solutions:

• 1.  Develop interactive programs

• 2. Use surface grammatical features of a text.

Questions

• 1.       How are references marked in different ways indifferent kinds of texts?

• 2.       How does the sequence of verbs with a text develop with respect to the marking of tense and voice?

Reference

• Noun phrases are major device in reference to objects, people and other entities.

• Reference by noun phrase can be full noun phrase or a pronoun, the former expresses new information while the latter given information.

Types of reference• Exorphoric: also called text-external, referring dire

ctly to the speaker and addressee. E.g. you, I.• Anaphoric: a person or thing that has already been

referred to in the text. E.g. it, that• Inferrable: something that can be inferred accordin

g to common sense, and that is neither exorphoric nor anaphoric, as the restructuring and its debt burden in The engineering and consulting firm, which has been plagued by losses for five years, said the restructuring is required to relieve its debt burden and “acute shortage of cash.”

Characteristics of referring expressions

• Four parameters• Status of information: given versus new• For given information, type of reference: anaphori

c, exophoric, or inferrable• For anaphoric reference, form of expression: pron

oun, synonym, or repetition• For anaphoric reference, the distance between the

anaphoric expression and its antecedent

Steps

• 1.       grammatically tag all texts

• 2.       go through the interactive program, stopping when it reaches a noun or phonoun.

• 3.       prompt the user to select the correct codes for that noun phrase.

Computer processing

• Information status: pronouns are automatically coded as given information. For each noun, the program automatically checks whether there is an earlier occurrence of the same noun in the text. If there is, the repeated noun is automatically coded as given information. All other full nouns are pre-code as new information. These nouns are then checked interactively to determine whether they actually represent given information.

Type of reference

• The pronouns I and you are automatically coded as marking expophoric reference. Third person pronouns are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences. Nouns with given informational status are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences.

Forms of anaphoric expression

• If nouns have been coded as anaphoric and an earlier occurrence of the same noun was found in the text, the referring expression is automatically identified as a noun repetition. Other anaphoric nouns are coded as synonymous.

Distance between the target referring expression and its antecedent• The antecedent of all anaphoric nouns and

pronouns must be identified. For repeated nouns, the antecedent is automatically pre-coded as the earlier occurrence of the same noun; these antecedents are checked interactively to determine if there is a close synonymous expression. for all other nouns and pronouns, the user of the interactive program must type in the antecedent. The distance between the target referring expression and its antecedent can be computed automatically.

Register and Types of Information

Register and Types of Information

• Reference: Conversation and speech have relatively frequent referring expression, although news has the largest number of referring expressions.

• Given/new information: Conversation and speech rely heavily on given information while news and academic prose have more new information.

Types of Reference

Types of Reference

• Exophoric pronouns: account for over half of all given references in conversation, but it is not the case with written registers.

• Anaphoric: written registers rely heavily on it.

• The high proportion of expressions marking new information accounts for the reliance on anaphoric reference in written registers.

Average distance measures for four registers

• Conversation 4.5

• Public speeches 5.5

• News reportage 11.0

• Academic prose 9.0

This makes sense given the difference in the production and comprehension circumstances of written and spoken registers.

Conversation and speeches must be produced and comprehended on-line. Co-references with short anaphoric distance are easier to understand.

Frequent use of exophoric pronouns referring to the speaker or listener in conversation

Average distance measures for pronominal versus

full noun anaphoric expressions Average pronominal

distance Average full noun distance

Conversation 3.0 9.0

Public speeches 3.5 10.0

News reportage 3.0 13.5

Academic prose 2.5 10.0

Average distance measures for pronominal

versus full noun anaphoric expressions • Pronouns tend to occur much close to their

antecedent than repeated full nouns.

• The greater the number of intervening referring expressions, the greater the chance for ambiguity and confusion over the intended reference of pronominal forms. Thus full noun expressions are preferred for anaphoric reference over large distances.

Discourse maps of verb tense and voice

• There are shifts in communicative purpose within the course of a text.

• Example: research articles follow a standard four-part organization: Introduction, Methods, Results, discussion (I-M-R-D).

Steps of analysis of 19 medical research articles

• Step 1: frequency counts of present tense, past tense and agentless passives across the IMRD sections.

• Step 2: calculate the average frequency counts for each type of section.

• Step 3: Compute for ANOVA and correlation coefficients for each linguistic features. The significant level se set at 0.001.

Mean scores (per 1,000 words) of selected linguistic features across the I-M-R-D sections of English medical research articles (N=19)

Section

Linguistic feature I M R D

Present tense

F=29.25; p<.001; r2=.549

47.9 21.1 35.9 60.6

Past tense

F=36.74; p<.001; r2=.605

20.7 48.5 40.3 13.0

Agentless passives

F=33.17; p<.001; r2=.580

18.4 39.9 16.9 16.3

p<.001: H0 rejected. The difference between groups is significantly larger than the difference within groups.r2=.549: 54.9% of the variation in the normed counts for present tense can be accounted for by knowing the register category of each text. The differences across registers in the use of present tense verbs are very important in addition to being statistically significant.

Findings• Present tense occurs most frequently in discussion

sections, and somewhat less frequently in introductions. Both sections tend to emphasize on the current state of our knowledge and the present implications of research findings.

• Past tense appears more in methodology and result sections, reflecting a focus on the reportage of past events and procedures.

• Agentless passives has a high frequency in methodology sections, presenting events impersonally.

top related