corpus 06 discourse characteristics. reasons why discourse studies are not corpus-based: 1. many...

22
Corpus 06 Discourse Characteristics

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Corpus 06

Discourse Characteristics

Page 2: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

• Reasons why discourse studies are not corpus-based:

• 1. Many discourse features cannot be identified automatically. E.g. known/new information

• 2.  Analysis tools are not helpful.

• Solutions:

• 1.  Develop interactive programs

• 2. Use surface grammatical features of a text.

Page 3: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Questions

• 1.       How are references marked in different ways indifferent kinds of texts?

• 2.       How does the sequence of verbs with a text develop with respect to the marking of tense and voice?

Page 4: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Reference

• Noun phrases are major device in reference to objects, people and other entities.

• Reference by noun phrase can be full noun phrase or a pronoun, the former expresses new information while the latter given information.

Page 5: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Types of reference• Exorphoric: also called text-external, referring dire

ctly to the speaker and addressee. E.g. you, I.• Anaphoric: a person or thing that has already been

referred to in the text. E.g. it, that• Inferrable: something that can be inferred accordin

g to common sense, and that is neither exorphoric nor anaphoric, as the restructuring and its debt burden in The engineering and consulting firm, which has been plagued by losses for five years, said the restructuring is required to relieve its debt burden and “acute shortage of cash.”

Page 6: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Characteristics of referring expressions

• Four parameters• Status of information: given versus new• For given information, type of reference: anaphori

c, exophoric, or inferrable• For anaphoric reference, form of expression: pron

oun, synonym, or repetition• For anaphoric reference, the distance between the

anaphoric expression and its antecedent

Page 7: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Steps

• 1.       grammatically tag all texts

• 2.       go through the interactive program, stopping when it reaches a noun or phonoun.

• 3.       prompt the user to select the correct codes for that noun phrase.

Page 8: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Computer processing

• Information status: pronouns are automatically coded as given information. For each noun, the program automatically checks whether there is an earlier occurrence of the same noun in the text. If there is, the repeated noun is automatically coded as given information. All other full nouns are pre-code as new information. These nouns are then checked interactively to determine whether they actually represent given information.

Page 9: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Type of reference

• The pronouns I and you are automatically coded as marking expophoric reference. Third person pronouns are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences. Nouns with given informational status are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences.

Page 10: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Forms of anaphoric expression

• If nouns have been coded as anaphoric and an earlier occurrence of the same noun was found in the text, the referring expression is automatically identified as a noun repetition. Other anaphoric nouns are coded as synonymous.

Page 11: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Distance between the target referring expression and its antecedent• The antecedent of all anaphoric nouns and

pronouns must be identified. For repeated nouns, the antecedent is automatically pre-coded as the earlier occurrence of the same noun; these antecedents are checked interactively to determine if there is a close synonymous expression. for all other nouns and pronouns, the user of the interactive program must type in the antecedent. The distance between the target referring expression and its antecedent can be computed automatically.

Page 12: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Register and Types of Information

Page 13: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Register and Types of Information

• Reference: Conversation and speech have relatively frequent referring expression, although news has the largest number of referring expressions.

• Given/new information: Conversation and speech rely heavily on given information while news and academic prose have more new information.

Page 14: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Types of Reference

Page 15: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Types of Reference

• Exophoric pronouns: account for over half of all given references in conversation, but it is not the case with written registers.

• Anaphoric: written registers rely heavily on it.

• The high proportion of expressions marking new information accounts for the reliance on anaphoric reference in written registers.

Page 16: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Average distance measures for four registers

• Conversation 4.5

• Public speeches 5.5

• News reportage 11.0

• Academic prose 9.0

This makes sense given the difference in the production and comprehension circumstances of written and spoken registers.

Conversation and speeches must be produced and comprehended on-line. Co-references with short anaphoric distance are easier to understand.

Frequent use of exophoric pronouns referring to the speaker or listener in conversation

Page 17: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Average distance measures for pronominal versus

full noun anaphoric expressions Average pronominal

distance Average full noun distance

Conversation 3.0 9.0

Public speeches 3.5 10.0

News reportage 3.0 13.5

Academic prose 2.5 10.0

Page 18: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Average distance measures for pronominal

versus full noun anaphoric expressions • Pronouns tend to occur much close to their

antecedent than repeated full nouns.

• The greater the number of intervening referring expressions, the greater the chance for ambiguity and confusion over the intended reference of pronominal forms. Thus full noun expressions are preferred for anaphoric reference over large distances.

Page 19: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Discourse maps of verb tense and voice

• There are shifts in communicative purpose within the course of a text.

• Example: research articles follow a standard four-part organization: Introduction, Methods, Results, discussion (I-M-R-D).

Page 20: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Steps of analysis of 19 medical research articles

• Step 1: frequency counts of present tense, past tense and agentless passives across the IMRD sections.

• Step 2: calculate the average frequency counts for each type of section.

• Step 3: Compute for ANOVA and correlation coefficients for each linguistic features. The significant level se set at 0.001.

Page 21: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Mean scores (per 1,000 words) of selected linguistic features across the I-M-R-D sections of English medical research articles (N=19)

Section

Linguistic feature I M R D

Present tense

F=29.25; p<.001; r2=.549

47.9 21.1 35.9 60.6

Past tense

F=36.74; p<.001; r2=.605

20.7 48.5 40.3 13.0

Agentless passives

F=33.17; p<.001; r2=.580

18.4 39.9 16.9 16.3

p<.001: H0 rejected. The difference between groups is significantly larger than the difference within groups.r2=.549: 54.9% of the variation in the normed counts for present tense can be accounted for by knowing the register category of each text. The differences across registers in the use of present tense verbs are very important in addition to being statistically significant.

Page 22: Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically

Findings• Present tense occurs most frequently in discussion

sections, and somewhat less frequently in introductions. Both sections tend to emphasize on the current state of our knowledge and the present implications of research findings.

• Past tense appears more in methodology and result sections, reflecting a focus on the reportage of past events and procedures.

• Agentless passives has a high frequency in methodology sections, presenting events impersonally.