macquarie rt05s speaker diarisation system steve cassidy centre for language technology macquarie...
Post on 27-Dec-2015
217 Views
Preview:
TRANSCRIPT
Macquarie RT05s Speaker Diarisation System
Steve Cassidy
Centre for Language TechnologyMacquarie University
Sydney
2©19 Apr 2023 Macquarie University
System Goals
• Develop a simple end-to-end system for the SPKR task
• Platform for experimentation • Improve on RT04s system
3©19 Apr 2023 Macquarie University
Overall Results
0
10
20
30
40
50
60
70
80
90
AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT
4©19 Apr 2023 Macquarie University
System OverviewFeature Extractio
n
SAD
Segmentation
Turn Clusterin
g
Speaker ID
• Single Distant Microphone• Implemented in C and Tcl• Runs in around 6x real time on
single AMD64 • Developed with RT04 devtest
data– No AMI or VT data seen
before eval
5©19 Apr 2023 Macquarie University
Feature ExtractionFeature Extractio
n
Feature Extractio
n
SAD
Segmentation
Turn Clusterin
g
Speaker ID
• 26 coefficients:– 12 MFCC– RMS Energy– Delta Coefficients
• 10ms frame rate, 25.6ms window
• Mean subtraction based on mean of first 60 seconds of file
• Uses the KTH Snack toolkit
6©19 Apr 2023 Macquarie University
Speech Activity DetectionFeature Extractio
n
SADSAD
Segmentation
Turn Clusterin
g
Speaker ID
• Goal: find obvious regions of non-speech for gross segmentation of recording
• GMMs for speech and non-speech– Speech model: 32 mixtures– Non-speech model: 8 mixtures
• Trained on RT04s devtest data set– Reference labels generated from
speaker labelling– Ignored silence regions < 0.3s
7©19 Apr 2023 Macquarie University
Speech Activity DetectionFeature Extractio
n
SADSAD
Segmentation
Turn Clusterin
g
Speaker ID
• Evaluate frame classification error (%):Dataset NSPER SPER
RT04s unseen 32 19
RT05s 47 15
8©19 Apr 2023 Macquarie University
Speech Activity DetectionFeature Extractio
n
SADSAD
Segmentation
Turn Clusterin
g
Speaker ID
• SAD is performed by classifying successive windows of 10 frames using the GMM models
• Consecutive regions are merged and labelled
• Non-speech < 0.35s merged with following segment
• Speech < 0.15s merged with following non-speech
9©19 Apr 2023 Macquarie University
Speech Activity DetectionFeature Extractio
n
SADSAD
Segmentation
Turn Clusterin
g
Speaker ID
• Evaluation– Frame classification error– Boundaries missed
– nothing within 0.5s
– Boundaries inserted inside real segments
Meeting
Frame Error
%
Boundary Error
% # Auto
NSPER
SPER
Miss FP
CMU 1415
89 7 91 77 45
ICSI 1100
99 4 85 88 99
NIST 0939
71 9 83 84 97
AMI 1206 43 18 25 79 348VT 1430 100 0 99 50 2
10©19 Apr 2023 Macquarie University
Turn Segmentation Feature Extractio
n
SAD
Segmentation
Segmentation
Turn Clusterin
g
Speaker ID
• Speech regions are segmented using BIC criterion
• Compare fit of single gaussian model of sequence with pair of models each side of break
• Fixed windows of 200 frames advanced over speech region
• Peaks in delta BIC curve indicate change points
11©19 Apr 2023 Macquarie University
Turn Segmentation Feature Extractio
n
SAD
Segmentation
Segmentation
Turn Clusterin
g
Speaker ID
0 50 100
CMU/98
ICSI/198
NIST/257
AMI/427
VT/168
% Error
FPMiss
12©19 Apr 2023 Macquarie University
Turn ClusteringFeature Extractio
n
SAD
Segmentation
Turn Clusterin
g
Turn Clusterin
g
Speaker ID
• Given a set of speaker turns, find natural clusters
• Number of clusters unknown• Requires:
– Distance metric on speaker turns
– Clustering algorithm– Cluster evaluation metric
13©19 Apr 2023 Macquarie University
Speaker Similarity
Mean + variance of feature vectorsK-L distance metric
14©19 Apr 2023 Macquarie University
Turn ClusteringFeature Extractio
n
SAD
Segmentation
Turn Clusterin
g
Turn Clusterin
g
Speaker ID
• Implementation:– Select segments longer than
1.5s for clustering– KL distance on mean/variance of
features– Hierarchical clustering – Select labellings for 2, 3…N
speakers– Cluster evaluation performed
after speaker ID
15©19 Apr 2023 Macquarie University
Speaker IDFeature Extractio
n
SAD
Segmentation
Turn Clusterin
g
Speaker ID
Speaker ID
• Use cluster labelled turns to train speaker models– 32 mixture GMM
• Now classify and re-label all speaker turns
• Potentially correct poor clustering decisions
• Very small amounts of data to support models
16©19 Apr 2023 Macquarie University
Overall Results
0
10
20
30
40
50
60
70
80
90
AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT
top related