multi-modal corpus design, construction and usepszaxc/dress/baal07.pdf · capturing, transcribing...
TRANSCRIPT
Multi-modal corpus design,construction and use
David Evans, Dawn Knight, Ronald Carter and Svenja Adolphs
BAAL 20076-8th September 2007, The University of Edinburgh
Introducing the Digital Record Project:
• 3-year research initiative, funded by the Economic andSocial Research Council (ESRC)
• Part of an e-Social Science ‘Node’ based at TheUniversity of Nottingham
• Interdisciplinary project, involving staff from Psychology,Applied Linguistics and Computer Science
• Develop a multi-modal corpus of spoken interaction: theNottingham Multi-Modal Corpus (NMMC)
The Nottingham Multi-Modal Corpus (NMMC):
Corpus data:
• 250,000 words• 125,000 words of 1-party data
125,000 words of 2-party data• Data in three different modes: textual, audio and video
Corpus tool-bench:
• Develop a reusable corpus tool (with appropriatelinguistic software)
• Search lexical, prosodic and gestural features of spokendiscourse
Key Methodological Issues:
1) Data collection and collation:Capturing, transcribing and aligning, and addinggesture to transcription
2) Tracking, defining and coding gesture of interest:Using specifically developed software to track andautomatically encode gestures according to a pre-defined kinesic coding scheme
3) Representing the data in an easy-to-use interface forfurther analysis:
Constructing an intelligent corpus database andassociated software (including a text/ gesture
concordancer)
1a) Capturing data
Naturalistic data v. Usable video image
1b) Transcribing and aligning data
• All data is transcribed using CANCODE transcriptionconventions.
• Data is also time-stamped using Transana, linking thetextual and audio streams:
¤<139851><$1> But if it's if it's utterly irrelevant then you're alright.
¤<143459><$2> Right.¤<143793><$1> Do you see what I mean cos cos
you're not there's no interfering factor then.
¤<147602><$2> Yeah so s=¤<148138><$1> Erm so that sounds like it's okay.¤<150144><$2> Okay.
• Do you see what I mean cos cos you'renot there's [no interfering factor] then.
onset stroke retraction
1c) Adding gesture to transcription
2a) Defining gestures of interest for codification
STUFF
Figure 2: Division of the gesture space fortranscription purposes. (From McNeill, 1992: 378 )
2b) Coding gestures of interest
Figure 3: Computer image trackingapplied to video
We have developed a4-point coding schemefor hand movement:
1) Left hand moves to the left
2) Left hand moves to the right
3) Right hand moves to the left
4) Right hand moves to the right
2c) Turning raw data to corpus data
Figure 4: An excel output generated by the tracker
3a) Requirements for MM corpus representation
= 2nd Generation = 3rd Generation
Figure 5: Defining ‘3rd Generation’ corpus software
3b) Current shortcomings of corpus software
• Current tools tend to focus either on the management of dataor upon the processes of coding and annotating previouslycollected data (examples include Transana, Anvil, NITE XMLWorkbench, ELAN)
3b) Current shortcomings of corpus software
• There does not appear to be a tool available to supportthe integration of these individual processes, supportingthe research process from:
The ‘Record’Phase
OrganisingRecords
AnalysingData
Defining andCoding Data
3c) Introducing DRS: Basic user information
• The DRS (formerly ReplayTool) enables the replay andannotation of large quantities of time-based datasets.
• It allows for the simultaneous synchronized replay ofmultiple data sources including videos, system log files,spatial data.
• In addition to the actual replay and annotation of suchdata sets, the DRS will also enable the user to performtasks with their data files that aid the organisation oftheir data sets.
3d) DRS: A real-time demo
Demonstrating the basic corpus tool-benchinterface, for the representation and replay ofindividual sets of encoded data, and theconcordance tool that has been developed aspart of the tool to enable detailed linguisticenquiry:
http://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/webstart/drs.jnlphttp://www.mrl.nott.ac.uk/research/projects/dress/software/DRS/replaytool.zip
3f) Outlining ethical issues and concerns
• Defining ‘consent’
• Anonymisation in textual, audio and video data:the limitations of pixellisation
• Re-use and distribution problems
Contacts:
David Evans: [email protected]
Dawn Knight: [email protected]
Ronald Carter: [email protected]
Svenja Adolphs: [email protected]