speaker diarisation and large vocabulary recognition at cstr: the ami/amida system fergus mcinnes 7...
TRANSCRIPT
Speaker Diarisation and Large Vocabulary Recognition at CSTR:
The AMI/AMIDA System
Fergus McInnes7 December 2011
History – AMI, AMIDA and recent developments
Architecture – processing graph, modules, directories
Getting and running the system Points for further work
History (1): AMI and AMIDAAMI Project (Augmented Multi-party Interaction): Edinburgh, Sheffield, Brno, Twente, IDIAP, ICSI, TNO
and others; 2004-2006 Capture, analysis, indexing and browsing of meeting data
AMIDA Project (AMI with Distance Access): AMI Consortium as above; 2006-2009 Extension to videoconferences
Meeting transcription system: Modules developed by multiple partners Multiple versions: IHM/MDM, offline/online,
different architectures and platforms Took part in NIST Rich Transcription evaluations in 2007
and 2009 – hence “RT07” and “RT09” versions
History (2): Developments at CSTRLegacy of AMI and AMIDA:
RT09 offline system for multiple distant microphones (MDM), running on ECDF compute server Eddie(also individual headset microphone (IHM) and RT07 systems)
– used since 2009 by several people at CSTR
Developments in 2011 (FRM – Cisco Project): Documentation written Scripts and config files tidied up Changes to a few modules Files placed in new Subversion repository Additional modules and interfacing to support PodCastle
application
Architecture (1): Overall structure
Beamforming
Audio signals from multiple microphones
Beamforming
Speech/non-speech segmentation
Speaker diarisation
Speech recognition
Speaker-attributed text with timings and scores
Padding and noise reduction
Architecture (2): Details of speech recognition
PLP with VTLN,
CMN/CVN, HLDA
Decoding pass 1
CMLLR adaptation
Decoding pass 2
VTLN estimation(warp factor per speaker)
PLP coding, CMN/CVN
Fbank with VTLN,
CMN/CVN, HLDA
Feature merging, CMN/CVN
Decoding pass 3CMLLR adaptation
Decoding pass 4MLLR adaptation
Waveform
Decoding by Juicer;HMMs and LMdiffer from passes 1-2 to passes 3-4
Architecture (3): ROTK framework(Resource Optimisation Toolkit)
Modules (strictly module instances) connected together in processing graph (“mpg” file) – read by Python script sgproc, which creates directories, calls runmod script for each module instance and keeps track of progress
Directory structure per module instance:
MI directory
inidal
links todata files in preceding MIs’ out directories
in.dlp(list of dal files & job numbers)
ms0000.dalms0001.dal... (data lists)
outodal
outputdata files
out.dlp(list of dal files & job numbers)
ms0000.dalms0001.dal... (data lists)
working files and subdirectories (module-specific)
[created by runmod]
[created by sgproc]
Architecture (4): Parallel processing
Dependent on runmod script for each module, but typically...
Different recording sessions, and speakers within each session, are processed in parallel (some modules also subdivide a speaker’s data if amount is large)
runmod (or a subsidiary script) submits jobs to grid and records the job numbers
Jobs for a later MI may be submitted before input data from an earlier MI are ready – using“-w <jobnumber>” option to submitjob, which calls qsub with “-hold_jid <jobnumber>”
Architecture (5): File locationsIn Subversion repository
(https://svn-kerberos.ecdf.ed.ac.uk/repo/inf/cstr/rec/trunk):pkg/rotk/b0013 – sgproc and system utilitiespkg/jet/v0.04 – submitjob and gridenv.<host>.csh filespkg/mod/<module-name> – module-specific files:
runmod, subsidiary scripts, source codemdm/mpg/*.mpg – processing graphsmdm/cfg/*.env – config files for all module instancesmdm/global.cfg – template for global config filemdm/run-mdm.* – templates for top-level scripts to call
sgproc with specific processing graphs
On Eddie (in /exports/work/inf_hcrc_cstr_nst/amiasr/asrcore – locations specified in config files):
exp/sysopt/bin/*/* – program files (sox, SHoUT, HTK, STK, Juicer etc)
exp/sysopt/mdm-sys09dev/lib/*/* – HMMs, language models etc
Getting and running the systemGet an account on EddieGet access to repository (give me your UUN)
Create a system directory <SysD> and check out a copy of the system there
Run copy_files (to copy and compile some binary files)Create a working directory <WD> (somewhere with enough
space – probably best under /exports/work)Copy global.cfg and run-mdm.pad to <WD> and edit to
specify <SysD> and project codeCreate <WD>/data and put your wav files there
(src-<source>_ses-<session>_chn-<channel>.wav – 16kHz mono)
Create list of data files (without “.wav” extension) in <WD>/default.dal
Run run-mdm.* – results will appear in<WD>/JU-M1-CMLLR4_MLLR32_0_D/out
Issues and points for further work
We don’t have source code for some of the programs (e.g. SFeaCat, SfeaStack) – if possible we should replace these by our own or other open-source equivalents
Many of the scripts are opaque (tcsh scripts calling Perl, building other scripts and then running them, etc)
Licensing for some components is too restrictive for a commercial application
Use of Juicer makes it difficult to adapt LM and vocabulary – desirable for many applications