cava: a human communication audio-visual archive matt mahon [1], suzanne beeke [1], merle mahon [2]...

CAVA: a human Communication Audio-Visual ArchiveMatt Mahon[1], Suzanne Beeke[1], Merle Mahon[2] and Martin Moyle[3]

UCL Departments of Language and Communication[1], Developmental Science[2] and Library Services[3]

Clockwise from above: Dissemination-quality video (MPG)[a]; preservation video (AVI)[b]; preferred format standards.

Data and formatsWhy is CAVA needed?The CAVA project aims to establish a repository for audio-visual data on real-life human communication for spoken and signed languages. •In order to investigate human communication and interaction, researchers need hours of audio-visual data, sometimes recorded over periods of months or years.• Collecting and cataloguing such valuable data is time-consuming and expensive. Once it is collected and ready to use, it makes sense to get the maximum value from it by reusing it and sharing it among the research community.

MetadataIt is not enough to simply collect and standardise the quality of the data; it must be readily searchable.•Natural audio-visual data tends to defy easy classification, and may lead to idiosyncratic solutions to preservation, metadata and access issues.•CAVA uses a modified metadata standard based on the ISLE MetaData Initiative (IMDI), a schema designed for language resources. •Principally the UCL Deafness, Cognition and Language Research unit (DCAL) subset, the CAVA subset presents a pragmatic solution.•All the information required for the metadata record is information normally collected in the course of research; fields which do not apply may be left blank.

Below: A complete metadata record. This record includes an MPEG video file, a WAV audio file and a transcription in Word format.

Still images from video:

[a, b]: ‘1 AB 10-04 T’, Mahon, M. Department of Health and University College London, EAL Deaf Children study, 2009.

[c]: ‘D3RA5’, Beeke, S. University College London, The Evaluation of a Novel Conversation-focused Therapy for Agrammatism study, 2009.

Our website: www.ucl.ac.uk/ls/cavaThe archive: http://digitool.ucl.ac.uk

PilotThe CAVA pilot launched in September 2009, with four objects in the archive. •The repository, which is still in development, now contains four datasets with over 170 hours of audio-visual data. •The CAVA team will also be piloting limited access to datasets through UCL’s VLE, Moodle.•The CAVA team are currently accepting data for dissemination from researchers at a variety of institutions, and are considering requests to access data from the repository.•If you are interested in including your data in the repository, or accessing the data we hold, please contact the Project Officer at [email protected].

Above: Preservation-quality video (AVI)[c].

AccessWell-implemented access management is crucial to the success of the repository, given the wide range of ethical and copyright restrictions on the data.•As the data is collected it is stored using the UCL Library Services Digital Collections service, which runs on the Ex Libris DigiTool platform. •Access to Digital Collections requires a unique login and password which will be assigned by the CAVA team upon completion of the end user licence.•Video clips, transcripts (where available) and descriptive metadata can be uploaded to the repository in batches, maintaining the relationships between the one or more versions of each video recording.•Technical metadata is generated automatically, and appropriate access restrictions and exceptions are applied. •All data accepted by the archive will have appropriate permissions for the various types of dissemination. Users will be available to download compressed video or uncompressed audio-only files.

Above left: CAVA on the UCL Digital Collections front page. Above right: The CAVA repository main page.

•Natural data can often be used for more than the purpose its collector intended. Researchers may be able to save time and money, or improve the depth of their observations and conclusions, by reusing existing data instead of collecting their own.

What formats will CAVA manage?•The data which will be placed in the repository comes from a wide range of sources, in a wide range of formats. Consequently it has a wide range of software requirements, depending on the equipment used to make the recordings. •Our aim is to introduce uniformity where practical, ideally archiving an audio-only and a compressed video copy of each recording. •As well as the data itself, a small sample video from each data set will be available by streaming at collection level, so that potential users can explore the repository and select the collections most appropriate to their work.

Below: A workflow for uploading data and gaining access to the repository.

Above: A pilot browse structure.

CAVA team receives

metadata form, licences and the

data itself

Prospective user completes

licence forms

The data is made available through the

repository, and appropriate users are

given access

CAVA team arranges user access to the

repository

Project officer prepares data for

upload to the repository

Data is uploaded in batches

Depositor completes

metadata form and licences

(Project officer is available to help

with completion of the metadata)

START

START

cava: a human communication audio-visual archive matt mahon [1], suzanne beeke [1], merle mahon [2]...

Documents

valuable data

hours of audiovisual

natural audiovisual

cava team

cava subset

cava project

ukpilotthe cava pilot

video clips