speaker diarisation and large vocabulary recognition at cstr: the ami/amida system fergus mcinnes 7...

Speaker Diarisation and Large Vocabulary Recognition at CSTR:

The AMI/AMIDA System

Fergus McInnes7 December 2011

History – AMI, AMIDA and recent developments

Architecture – processing graph, modules, directories

Getting and running the system Points for further work

History (1): AMI and AMIDAAMI Project (Augmented Multi-party Interaction): Edinburgh, Sheffield, Brno, Twente, IDIAP, ICSI, TNO

and others; 2004-2006 Capture, analysis, indexing and browsing of meeting data

AMIDA Project (AMI with Distance Access): AMI Consortium as above; 2006-2009 Extension to videoconferences

Meeting transcription system: Modules developed by multiple partners Multiple versions: IHM/MDM, offline/online,

different architectures and platforms Took part in NIST Rich Transcription evaluations in 2007

and 2009 – hence “RT07” and “RT09” versions

History (2): Developments at CSTRLegacy of AMI and AMIDA:

RT09 offline system for multiple distant microphones (MDM), running on ECDF compute server Eddie(also individual headset microphone (IHM) and RT07 systems)

– used since 2009 by several people at CSTR

Developments in 2011 (FRM – Cisco Project): Documentation written Scripts and config files tidied up Changes to a few modules Files placed in new Subversion repository Additional modules and interfacing to support PodCastle

application

Architecture (1): Overall structure

Beamforming

Audio signals from multiple microphones

Beamforming

Speech/non-speech segmentation

Speaker diarisation

Speech recognition

Speaker-attributed text with timings and scores

Padding and noise reduction

Architecture (2): Details of speech recognition

PLP with VTLN,

CMN/CVN, HLDA

Decoding pass 1

CMLLR adaptation

Decoding pass 2

VTLN estimation(warp factor per speaker)

PLP coding, CMN/CVN

Fbank with VTLN,

CMN/CVN, HLDA

Feature merging, CMN/CVN

Decoding pass 3CMLLR adaptation

Decoding pass 4MLLR adaptation

Waveform

Decoding by Juicer;HMMs and LMdiffer from passes 1-2 to passes 3-4

Architecture (3): ROTK framework(Resource Optimisation Toolkit)

Modules (strictly module instances) connected together in processing graph (“mpg” file) – read by Python script sgproc, which creates directories, calls runmod script for each module instance and keeps track of progress

Directory structure per module instance:

MI directory

inidal

links todata files in preceding MIs’ out directories

in.dlp(list of dal files & job numbers)

ms0000.dalms0001.dal... (data lists)

outodal

outputdata files

out.dlp(list of dal files & job numbers)

ms0000.dalms0001.dal... (data lists)

working files and subdirectories (module-specific)

[created by runmod]

[created by sgproc]

Architecture (4): Parallel processing

Dependent on runmod script for each module, but typically...

Different recording sessions, and speakers within each session, are processed in parallel (some modules also subdivide a speaker’s data if amount is large)

runmod (or a subsidiary script) submits jobs to grid and records the job numbers

Jobs for a later MI may be submitted before input data from an earlier MI are ready – using“-w <jobnumber>” option to submitjob, which calls qsub with “-hold_jid <jobnumber>”

Architecture (5): File locationsIn Subversion repository

(https://svn-kerberos.ecdf.ed.ac.uk/repo/inf/cstr/rec/trunk):pkg/rotk/b0013 – sgproc and system utilitiespkg/jet/v0.04 – submitjob and gridenv.<host>.csh filespkg/mod/<module-name> – module-specific files:

runmod, subsidiary scripts, source codemdm/mpg/*.mpg – processing graphsmdm/cfg/*.env – config files for all module instancesmdm/global.cfg – template for global config filemdm/run-mdm.* – templates for top-level scripts to call

sgproc with specific processing graphs

On Eddie (in /exports/work/inf_hcrc_cstr_nst/amiasr/asrcore – locations specified in config files):

exp/sysopt/bin/*/* – program files (sox, SHoUT, HTK, STK, Juicer etc)

exp/sysopt/mdm-sys09dev/lib/*/* – HMMs, language models etc

https://svn-kerberos.ecdf.ed.ac.uk/repo/inf/cstr/rec/trunk

Getting and running the systemGet an account on EddieGet access to repository (give me your UUN)

Create a system directory <SysD> and check out a copy of the system there

Run copy_files (to copy and compile some binary files)Create a working directory <WD> (somewhere with enough

space – probably best under /exports/work)Copy global.cfg and run-mdm.pad to <WD> and edit to

specify <SysD> and project codeCreate <WD>/data and put your wav files there

(src-<source>_ses-<session>_chn-<channel>.wav – 16kHz mono)

Create list of data files (without “.wav” extension) in <WD>/default.dal

Run run-mdm.* – results will appear in<WD>/JU-M1-CMLLR4_MLLR32_0_D/out

Issues and points for further work

We don’t have source code for some of the programs (e.g. SFeaCat, SfeaStack) – if possible we should replace these by our own or other open-source equivalents

Many of the scripts are opaque (tcsh scripts calling Perl, building other scripts and then running them, etc)

Licensing for some components is too restrictive for a commercial application

Use of Juicer makes it difficult to adapt LM and vocabulary – desirable for many applications

speaker diarisation and large vocabulary recognition at cstr: the ami/amida system fergus mcinnes 7...

Documents