investigating speech, thought and writing presentation in a corpus of spoken british english an ahrb...
TRANSCRIPT
Investigating speech, thought and writing presentation in a corpus of spoken British English
An AHRB funded project under the supervision of
Mick Short, Elena Semino and Tony McEnery
Research Assistants
John Heywood and Dan McIntyre
Project outline To compare speech, thought and writing
presentation in spoken and written English. To build a new corpus of 260,000 words of
spoken British English to compare with the ST&WP Written English Corpus (1995-99).
To investigate the presentation of speech, thought and writing in the ST&WP Spoken Corpus by tagging with the Leech and Short (1981) category set.
To further test and adapt the Leech and Short (1981) model of S&TP.
The project is funded until February 2003.
Construction of the corpus 120 texts - approximately 260,000
words. Texts rich in ST&WP taken from the
British National Corpus (BNC) and the Centre for North West Regional Studies (CNWRS) oral history archives at Lancaster University.
CNWRS interview tapes digitised to be time-aligned with text.
Number and distribution of NWRS files in the corpus
NWRS Archive
Family and Social Life Archive Childhood and Schooling Archive
Male Female Male Female
1890-1940 1940-1970 1890-1940 1940-1970
7 records 7 records 8 records 8 records 15 records 15 records
i.e. 60 files with an equal balance of male and female speakers in each age-range
Number and distribution of BNC files in the corpus
BNC spoken data
Spoken Demographic Spoken Context- Governed
Male Female
0-14 15-24 25-34 35-44 45-59 60+ 0-14 15-24 25-34 35-44 45-59 60+
5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files i.e. 60 files with an equal balance of male and female speakers in each age-range
The development of the tag-set
N NV NRSA-P NRS/IS FIS NRS/DS FDS
N NI NRTA-P NRT/IT FIT NRT/DT FDT
N NW NRWA-P NRWS/IW
FIW NRW/DW FDW
NRA NRSA NRS/IS FIS NRS/DS FDS
NRTA NRT/IT FIT NRS/DT FDT
Leech & Short (1981)
The ST&WP Written Project (1995…)3 main genres: Fiction, Biography & Autobiography, and Newspaper Journalism: each divided into Serious/Popular sections.
embedded, hypothetical, inferred, quote
The development of the tag-set – new tags
RM
A RV RSA-P RS/IS FIS RS/DS FDS
A RI RTA-P RT/IT FIT RT/DT FDT
A RN RWA-P RW/IW FIW RW/DW FDW
The ST&WP Spoken Project (2001)BNC spoken demographic data and NWRS oral history interviews
embedded, negative / absence, hypothetical, inferred, quote, reiterated, interrogative, imperative, uncompleted, 2 / 3 / 4
A 15-field tag-set: 5 main categories
FIELD CHARACTER ‘VALUE’
1 x, A, F, Anything! Free
2 x, #, R, I, D Representation, Indirect, Direct
3 x, S, T, W, V, I, N, M Speech, Thought, Writing, Voice, Internal state, WritiNg, Mention
4 x, A Act
5 x, P toPic
A 15-field tag-set: 10 category attributes
FIELD CHARACTER ‘VALUE’
6 x, #, 1, 2, 3, 4
# = odd interesting borderline cases, no.s = repeated (-ing or –ed) adjacent categories
7 xe embedded
8 xxg/a negative action etc e.g. 'we weren't allowed to go', absence eg 'I didn't say anything'
9 xxxh hypothetical
10 xxxxi inferred
11 xxxxxq quote
12 xxxxxxr iterative
13 xxxxxxxv/p interrogative, imperative
14 xxxxxxxxu uncompleted
15 xexxxxxxx2 level of embedding (2, 3, 4)
Issues arising Technical issues:
Legibility. Comparability between NWRS and BNC data.
Tagging issues: Comparability between written and spoken corpora. What counts as ST&WP? Functional and formal criteria. Embedding. Repetition (e.g. he said he said well he said). Report of ‘mention’. Reading, hearing, listening and singing dogs!