the cambridge cookie-theft corpus: a corpus of directed and spontaneous speech of brain-damaged...

The Cambridge Cookie-Theft Corpus:A Corpus of Directed and Spontaneous Speech

of Brain-Damaged Patients and Healthy Individuals

Caroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen Geertzenc

Billi Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine Tylera

aThe Centre for Speech, Language and the Brain, University of CambridgebThe MRC Cognition and Brain Sciences Unit, Cambridge

cComputation, Cognition and Language Group, RCEAL, University of Cambridge

Acknowledgments

• This work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G0500842).

Outline of talk

• Motivation for Corpus

• Data collection

• Transcription Guidelines

Motivation• To look at differences between speech populations: young

and old; and healthy and brain-damaged patients

• The brain-damaged patients have mainly left-lateral damage (known speech processing areas)

• Desire to characterise speech output in these populations.

• This characterization hasn’t been not done before with respect to language generation

Description of corpus• The finished corpus comprises of machine-

friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture description

• Brief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’

• Spontaneous speech task: 10 minute semi-prompted monologue

The ‘cookie-theft’ picture

From the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983

Participants• Healthy individuals

– volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies.

• Patients– aetiology is varied but damage mainly left lateralised– patients were selected from a number of sources

• Neuro-imaging scans available for a third and growing

Participants

The recordings

• For healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio.

• For patients, sometimes at their home, normally with a family member present

Transcription

• Producing a machine-parseable transcription– XML based– retain prosodic information as far as possible– Paying special attention to speech phenomena

(repetitions, hesitations, false-starts)

• Comparable corpora and existing guidelines

• DTD validated XML

Meta & participant data

Interview transcription

Outline of the transcription schema

• Meta-data– Gender– Age– Aetiology– Type of damage– Broad location of damage– Date of recording– Who was in the room

• Structural units– Utterance

“And I’ve been in my van uhuh but i’ve been out all day”– Segment

“(The kiddies are taking biscuits)(now one of them is falling off)”– Sub-segment

“(erm)(mum)(washing up)”

• Representing the nature of speech– Rep tag

“it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over”

– ‘…’ incompleteness“oh dear the sink is ... and oh my the children”

– Unclear tag etc.“and <unclear reason= ambiguous>taps</unclear> running”

• Suprasegmental features– Shifts

• Laughing• Language change etc

• Phonological information– phonological information

“The sink is <tr target=‘flooding’>blAdin</tr>”

– IPA transcriptions

• Anonymisation– All personal names/places replaced with reference

markers

• Misc– Kinetic– Vocal– Incident etc

The next phase• On the corpus

– Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrs

– Addressing shortfall within each aetiology

• Work derived from the corpus.– Identifying ages based on the cookie theft

description– Identifying damage based on the tasks– Speech production issues more generally

References

• Harold Goodglass and Edith Kaplan. 1983. Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.

Thank you

• Any questions?

the cambridge cookie-theft corpus: a corpus of directed and spontaneous speech of brain-damaged...

Documents

growing slide

language generation

university of cambridge

speech populations

spontaneous speech of

speech tasks

speech output

hours of speech