the cambridge cookie-theft corpus: a corpus of directed and spontaneous speech of brain-damaged...
TRANSCRIPT
The Cambridge Cookie-Theft Corpus:A Corpus of Directed and Spontaneous Speech
of Brain-Damaged Patients and Healthy Individuals
Caroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen Geertzenc
Billi Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine Tylera
aThe Centre for Speech, Language and the Brain, University of CambridgebThe MRC Cognition and Brain Sciences Unit, Cambridge
cComputation, Cognition and Language Group, RCEAL, University of Cambridge
Acknowledgments
• This work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G0500842).
Outline of talk
• Motivation for Corpus
• Data collection
• Transcription Guidelines
Motivation• To look at differences between speech populations: young
and old; and healthy and brain-damaged patients
• The brain-damaged patients have mainly left-lateral damage (known speech processing areas)
• Desire to characterise speech output in these populations.
• This characterization hasn’t been not done before with respect to language generation
Description of corpus• The finished corpus comprises of machine-
friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture description
• Brief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’
• Spontaneous speech task: 10 minute semi-prompted monologue
The ‘cookie-theft’ picture
From the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983
Participants• Healthy individuals
– volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies.
• Patients– aetiology is varied but damage mainly left lateralised– patients were selected from a number of sources
• Neuro-imaging scans available for a third and growing
Participants
The recordings
• For healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio.
• For patients, sometimes at their home, normally with a family member present
Transcription
• Producing a machine-parseable transcription– XML based– retain prosodic information as far as possible– Paying special attention to speech phenomena
(repetitions, hesitations, false-starts)
• Comparable corpora and existing guidelines
• DTD validated XML
Meta & participant data
Interview transcription
Outline of the transcription schema
• Meta-data– Gender– Age– Aetiology– Type of damage– Broad location of damage– Date of recording– Who was in the room
• Structural units– Utterance
“And I’ve been in my van uhuh but i’ve been out all day”– Segment
“(The kiddies are taking biscuits)(now one of them is falling off)”– Sub-segment
“(erm)(mum)(washing up)”
• Representing the nature of speech– Rep tag
“it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over”
– ‘…’ incompleteness“oh dear the sink is ... and oh my the children”
– Unclear tag etc.“and <unclear reason= ambiguous>taps</unclear> running”
• Suprasegmental features– Shifts
• Laughing• Language change etc
• Phonological information– phonological information
“The sink is <tr target=‘flooding’>blAdin</tr>”
– IPA transcriptions
• Anonymisation– All personal names/places replaced with reference
markers
• Misc– Kinetic– Vocal– Incident etc
The next phase• On the corpus
– Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrs
– Addressing shortfall within each aetiology
• Work derived from the corpus.– Identifying ages based on the cookie theft
description– Identifying damage based on the tasks– Speech production issues more generally
References
• Harold Goodglass and Edith Kaplan. 1983. Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.
Thank you
• Any questions?