investigating speech, thought and writing presentation in a corpus of spoken british english an ahrb...

10
Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short, Elena Semino and Tony McEnery Research Assistants John Heywood and Dan McIntyre

Upload: katelyn-roy

Post on 28-Mar-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Investigating speech, thought and writing presentation in a corpus of spoken British English

An AHRB funded project under the supervision of

Mick Short, Elena Semino and Tony McEnery

Research Assistants

John Heywood and Dan McIntyre

Page 2: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Project outline To compare speech, thought and writing

presentation in spoken and written English. To build a new corpus of 260,000 words of

spoken British English to compare with the ST&WP Written English Corpus (1995-99).

To investigate the presentation of speech, thought and writing in the ST&WP Spoken Corpus by tagging with the Leech and Short (1981) category set.

To further test and adapt the Leech and Short (1981) model of S&TP.

The project is funded until February 2003.

Page 3: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Construction of the corpus 120 texts - approximately 260,000

words. Texts rich in ST&WP taken from the

British National Corpus (BNC) and the Centre for North West Regional Studies (CNWRS) oral history archives at Lancaster University.

CNWRS interview tapes digitised to be time-aligned with text.

Page 4: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Number and distribution of NWRS files in the corpus

NWRS Archive 

 

Family and Social Life Archive Childhood and Schooling Archive

Male Female Male Female

1890-1940 1940-1970 1890-1940 1940-1970

7 records 7 records 8 records 8 records 15 records 15 records

i.e. 60 files with an equal balance of male and female speakers in each age-range

Page 5: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Number and distribution of BNC files in the corpus

BNC spoken data 

Spoken Demographic Spoken Context- Governed

Male Female

0-14 15-24 25-34 35-44 45-59 60+ 0-14 15-24 25-34 35-44 45-59 60+

5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files i.e. 60 files with an equal balance of male and female speakers in each age-range

Page 6: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

The development of the tag-set

N NV NRSA-P NRS/IS FIS NRS/DS FDS

N NI NRTA-P NRT/IT FIT NRT/DT FDT

N NW NRWA-P NRWS/IW

FIW NRW/DW FDW

NRA NRSA NRS/IS FIS NRS/DS FDS

NRTA NRT/IT FIT NRS/DT FDT

Leech & Short (1981)

The ST&WP Written Project (1995…)3 main genres: Fiction, Biography & Autobiography, and Newspaper Journalism: each divided into Serious/Popular sections.

embedded, hypothetical, inferred, quote

Page 7: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

The development of the tag-set – new tags

RM

A RV RSA-P RS/IS FIS RS/DS FDS

A RI RTA-P RT/IT FIT RT/DT FDT

A RN RWA-P RW/IW FIW RW/DW FDW

The ST&WP Spoken Project (2001)BNC spoken demographic data and NWRS oral history interviews

embedded, negative / absence, hypothetical, inferred, quote, reiterated, interrogative, imperative, uncompleted, 2 / 3 / 4

Page 8: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

A 15-field tag-set: 5 main categories

FIELD CHARACTER ‘VALUE’

1 x, A, F, Anything! Free

2 x, #, R, I, D Representation, Indirect, Direct

3 x, S, T, W, V, I, N, M Speech, Thought, Writing, Voice, Internal state, WritiNg, Mention

4 x, A Act

5 x, P toPic

Page 9: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

A 15-field tag-set: 10 category attributes

FIELD CHARACTER ‘VALUE’

6 x, #, 1, 2, 3, 4

# = odd interesting borderline cases, no.s = repeated (-ing or –ed) adjacent categories

7 xe embedded

8 xxg/a negative action etc e.g. 'we weren't allowed to go', absence eg 'I didn't say anything'

9 xxxh hypothetical

10 xxxxi inferred

11 xxxxxq quote

12 xxxxxxr iterative

13 xxxxxxxv/p interrogative, imperative

14 xxxxxxxxu uncompleted

15 xexxxxxxx2 level of embedding (2, 3, 4)

Page 10: Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,

Issues arising Technical issues:

Legibility. Comparability between NWRS and BNC data.

Tagging issues: Comparability between written and spoken corpora. What counts as ST&WP? Functional and formal criteria. Embedding. Repetition (e.g. he said he said well he said). Report of ‘mention’. Reading, hearing, listening and singing dogs!