masters thesis: speaker diarization for broadcast news audios
TRANSCRIPT
Speaker Diarization of BroadcastNews Audios
Submitted in partial fulfillment of the requirements
of the degree of
Bachelor of Technology and Master of Technology
by
Parthe Pandit
(Roll no. 10D070009)
Supervisor:
Prof. Preeti Rao
Department of Electrical Engineering
Indian Institute of Technology Bombay
2015
Dedicated to
George, Elaine, Kramer & Jerry
Parthe Pandit/ Prof. Preeti Rao (Supervisor): “Speaker Diarization of Broadcast News
Audios”, Dual Degree Dissertation, Department of Electrical Engineering, Indian Institute of
Technology Bombay, July 2015.
Abstract
Speaker Diarization is a multimedia indexing technology that makes use of audio information to
answer the question ”Who spoke when?” This thesis presents a step-by-step speaker diarization
system implemented in MATLAB that is evaluated using the Diarization Error Rate (DER) met-
ric. The proposed system, designed for segmenting audio recordings of broadcast news, provides
implementations of state-of-the-art i-vectors as well as the traditional GMM speaker models. A
graphical clustering algorithm introduced by Rouvier et al. in 2013 has also been implemented.
This clustering algorithm offer lower DER as well as a computational advantage compared to
the conventional GMM based hierarchical agglomerative clustering. An unsupervised speech
activity detector (SAD) has also been developed that discards nonspeech in two stages - silence
removal followed by music removal. The music removal subsystem has been adapted to classify
speech segments with background music, e.g. news headlines sections, as speech. The proposed
SAD achieves a favourable performance on the January 2013 subset of the REPERE corpus
compared to the supervised SAD of the LIUM diarization toolkit.
Index terms: unsupervised, speech activity detection, MATLAB, ILP clustering, REPERE
iv
Contents
Dissertation Approval ii
Declaration of Authorship iii
Abstract iv
List of Figures vii
List of Tables viii
1 Introduction 1
2 Evaluation of Speaker Diarization 4
2.1 Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Datasets for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 State of the Art in Speaker Diarization 9
3.1 Feature Extraction for Speaker Diarization . . . . . . . . . . . . . . . . . . . . . 10
3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Metric based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Model based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Systems participating in NIST-RT evaluations . . . . . . . . . . . . . . . 16
3.3.2 Broadcast news systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Speaker models for clustering . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 System Description and Evaluation 25
4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
Contents CONTENTS
4.2 Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Speech Activity Detection Algorithm . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Confidence measures for Speech Activity Detection . . . . . . . . . . . . . 29
4.3 Evaluation of Speech Activity Detection . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Evaluation on the NDTV dataset . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 REPERE dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Speaker Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Choice of Segmentation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1 Choice of speaker model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.2 Choice of clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Evaluation of Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7.1 HAC experiments with GMM speaker models . . . . . . . . . . . . . . . . 40
4.7.2 HAC with i-vector speaker models . . . . . . . . . . . . . . . . . . . . . . 40
4.7.3 ILP based experiments with GMM speaker models . . . . . . . . . . . . . 41
4.7.4 ILP clustering with i-vector speaker models . . . . . . . . . . . . . . . . . 42
4.7.5 Results on REPERE corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusion and Future work 45
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Acknowledgements 53
vi
List of Figures
1.1 Applications of Speaker Diarization - Joke-o-mat . . . . . . . . . . . . . . . . . . 2
2.1 Rich transcription generated from a speaker diarization system . . . . . . . . . . 4
2.2 Example diarization error rate calculation . . . . . . . . . . . . . . . . . . . . . . 6
3.1 A typical speaker diarization system . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Bayessian Inference Criterion for change detection . . . . . . . . . . . . . . . . . 13
3.3 Sliding window search for speaker change detection . . . . . . . . . . . . . . . . . 14
3.4 Growing window search for speaker change detection . . . . . . . . . . . . . . . . 15
3.5 Model based segmentation for SAD system . . . . . . . . . . . . . . . . . . . . . 17
3.6 Hierarchical agglomerative clustering . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Block diagram of proposed MATLAB system . . . . . . . . . . . . . . . . . . . . 25
4.2 Silence removal for Speech Activity Detection . . . . . . . . . . . . . . . . . . . . 28
4.3 Music removal for Speech Activity Detection . . . . . . . . . . . . . . . . . . . . 30
4.4 Extraction of i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 ILP Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 HAC clustering with GMM speaker models . . . . . . . . . . . . . . . . . . . . . 40
4.7 HAC clustering with i-vector speaker models . . . . . . . . . . . . . . . . . . . . 41
4.8 ILP clustering with GMM speaker models . . . . . . . . . . . . . . . . . . . . . . 42
4.9 Dimension of Total Variability space . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
List of Tables
1.1 Comparison of audio domains in speaker diarization research . . . . . . . . . . . 3
2.1 Simplification of diarization error rate calculation . . . . . . . . . . . . . . . . . . 5
2.2 Example of annotation and hypothesis segmentation for DER . . . . . . . . . . . 6
2.3 Annotated shows in REPERE corpus and their respective times . . . . . . . . . . 8
4.1 Size of GMM for speech and silence model . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Shape of covariance matrix of GMM for speech activity detection . . . . . . . . . 32
4.3 Cascade system for speech activity detection . . . . . . . . . . . . . . . . . . . . 32
4.4 Refining music model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Results for REPERE with 60 hours annotated . . . . . . . . . . . . . . . . . . . 33
4.6 DER with ORACLE experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 DER with best clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 Comparison of clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
Chapter 1
Introduction
The number of multimedia uploads on the internet every day is ever-increasing, especially after
smart phones and smart devices have gained popularity in the recent years. Multimedia pow-
erhouses and online search engines get trillions of search queries every day. Search engines now
also facilitate searching through audio, image and video databases. Keeping track of all the data
and storing it efficiently for possible future access is becoming more and more important. Hence
indexing techniques in multimedia formats are getting attention. Audio segmentation is one
such indexing technique where the content in a collection of multimedia recordings is organised
based on the semantic information provided through their audio data.
Speaker diarization is one such audio indexing problem where the question being asked to the
machine is “Who spoke when?” Speaker diarization acts as a precursor to many other speech
technologies. For instance performing automatic meeting transcriptions, automatic stenogra-
phers, Dictaphones all would function faster and more efficiently if the machine knows the
active speaker at any given instant.
Satirical TV shows that also sometimes do investigative journalism. The Last Week Tonight
by HBO for example, make remarks about current issues and provide clips from the past audio
and video databases that contain the relevant content [1], which in fact is a major reason for
the popularity of the show. This task of going through previous content is tedious to be done
manually and is simplified manifold if the multimedia content is well organised and indexed.
Similarly while going through repeat telecasts of talk shows or comedy shows, a user would like
to navigate through their interesting parts only. Interestingly, a research group at University
of California, Berkeley made an online software called the Joke-o-mat [2] for fans of the famous
sitcom Seinfeld. It provides a temporal labeling for all the episodes of the show so that fans can
enjoy re-runs of the show looking only at the punch lines. Indexing of sports matches can also
enable finding out where a goal was scored or a batter hit a home run and thus enhance user
1
Chapter 1. Introduction 2
Figure 1.1: Screen shot of the Joke-o-mat java applet which lets users browse throughepisodes of Seinfeld.
experience. The problem of speaker diarization has many interesting applications
The research problem of speaker diarization has been applied in various domains – Telephone
conversations, meeting recordings and broadcast news archiving. Although the problem remains
the same, the design criteria for each of these are different due to the nature of the audio
recording and the variability they possess. The peculiarities of a typical recording from each of
the 3 domains are comparatively mentioned below. The focus of this thesis is on solving speaker
diarization for the broadcast news domain.
A system has been developed in MATLAB that was tested on the two datasets – (i) the
NDTV dataset which is a set of broadcast recordings of Indian English news readings and (ii)
the REPERE corpus from the French REPERE broadcast news people discovery campaign
competition. The system has 3 decoupled blocks viz. the speech activity detector, the speaker
change detector and the speaker clustering block.
The rest of the thesis is organised as follows. Chapter 2 reviews the evaluation techniques
for speaker diarization systems and gives a summary of the two datasets on which the system
2
Chapter 1. Introduction 3
was evaluated. Chapter 3 breaks down the problem of speaker diarization into signal processing
and statistical machine learning problems and reviews speaker diarization systems mentioned
in literature to study the algorithms and techniques used therein. Chapter 4 gives a detailed
description of the speaker diarization system that has been developed in MATLAB and does an
evaluation of the available subsystems using metrics mentioned in Chapter 2. The last chapter
makes conclusions about the methods implemented in the proposed system and makes a few
remarks about future development of the system.
Broadcast news Meeting conversations Telephone conversationsLonger durations ofuninterrupted speech
Shorter durations ofuninterrupted speech
Shorter durations ofuninterrupted speech
Negligible speaker overlap Higher speaker overlap Moderate speaker overlapPresence of music, jingles,variety of background noise
Uniform backgroundconditions
Uniform backgroundconditions
Dominant speaker (anchor) No dominant speaker No dominant speakerNumber of speakersunknown
Number of speakersunknown
Number of speakersknown (generally 2)
Table 1.1: Design criteria for speaker diarization systems for broadcast news, meetingrecordings and telephone conversations.
3
Chapter 2
Evaluation of Speaker Diarization
A diarization system has to answer the question “Who spoke when?” without any a-priori infor-
mation about the speakers present in the audio recording. The output that is expected from the
system is of the form shown in Figure 2.1. In particular, note that speakers segments are not
expected to be labelled by their name, only by a unique speaker id, which is indicated by colour
for the recording in the figure. Speaker diarization is thus different from speaker verification or
speaker recognition where prior information for target speakers may be made available to the
system beforehand in the form of speaker models or speaker biometrics.
2.1 Diarization Error Rate
Meeting Diarization
The evaluation of a diarization system is done using a metric called the Diarization Error Rate
(DER) [4], which is the percentage of the time of the audio for which the speaker was wrongly
labelled. The output of the system is compared with a segment level manually annotated
Figure 2.1: Rich transcription generated from a speaker diarization system [3]
4
Chapter 2. Evaluation of Speaker Diarization 5
temporal transcription indicating the speaker labels. The formula 2.1 was provided for the
NIST RT speaker diarization evaluations 2005, 2006, 2007 and 2009. In these evaluations,
competitor systems had to perform speaker diarization on meeting conversation recordings. In
the evaluation formula, for a segment s of duration dur(s), Nref and Nhyp are the number of
speakers indicated by the annotations and hypothesized by the system respectively, and Ncorrect
is the number of speakers in segment s that were a correct match between the annotation and
hypothesis.
DER =
∑Ss=1 dur(s) · (max(Nref , Nhyp)−Ncorrect)∑S
s=1 dur(s) ·Nref
(2.1)
Broadcast News Diarization
In broadcast news, there is very little overlap. For the case where overlap is absent, the formula
for DER can be simplified. This error calculation is used in the broadcast news diarization
systems. Pyannote [5] is a python based toolkit that facilitates calculation specifically for such
systems
Total speech time (S) = Total non-speech time (NS) =Correctly Labelled Speech time (C1) +Missed Speech time (T1) +Incorrectly Labelled Speechtime (T3)
Correctly Labelled Non-speech time (C2) +False Alarm Speech time (T2)
Table 2.1: Simplification of diarization error rate calculation
The DER can be broken down systematically into 2 types. Consider an audio with S sec
speech and NS sec non-speech as indicated by the annotation. Non-speech includes silences,
speaker pauses, music, jingles, noise etc. The two categories can be further classified exhaustively
as shown in Table 2.1. Missed speech time is the time when the algorithm erroneously indicated
a segment as non-speech. False alarm speech time, on the other hand, is the time when the
algorithm erroneously indicated a segment as speech. These 2 errors occur during the speech
activity detection, which is a pre-processing step in almost all diarization systems. They are
numbered E1 = T1x100/S and E2 = T2x100/S. E1 is called missed speech rate (MSR) and
E2 is called false alarm speech rate (FASR)
Error E3=T3x100/S is partly contributed by errors in both the speaker segmentation and
the speaker clustering, and is often termed speaker error (SPK ERR). A speaker change if missed
during segmentation causes misclassification of the shorter of the two segments. If the system
segments the audio into more number of segments than indicated by the annotation, often
called oversegmentation, there is a chance to make up for the oversegmentation error during the
5
Chapter 2. Evaluation of Speaker Diarization 6
Figure 2.2: Example DER calculation. (a) annotation, (b) hypothesis [5]
clustering step by merging neighbouring segments together. However if the segment happens to
be small, the possibility to error increases due to less data to capture the speaker information in
the segment. Erroneously clustered speaker segments are the main reason for E3. Finally, DER
is the total error in the systems hypothesis.
DER = (E1 + E2 + E3)
Calculating SPK ERR makes use of the Hungarian algorithm to perform a matching between
algorithm labels and annotation labels. It iteratively finds the best matching cluster pair at each
step based on maximum overlap between the 2 sets of labels. Finding the optimal mapping is
needed because the system does not need to identify speakers by name and therefore its speaker
labels will differ from the labels in the reference transcript. An example DER calculation is
shown in table 2.2. Note that nonspeech segments are not labeled here. But the reference
annotation indicates nonspeech between 10s-12s, 20s-24s and 27s-40s
Annotation Hypothesis[Segment(0, 10)] = ’Alice (A)’ [Segment(2, 13)] = ’a’[Segment(12, 20)] = ’Bob (B)’ [Segment(13, 14)] = ’d’[Segment(24, 27)] = ’Alice (A) ’ [Segment(14, 20)] = ’b’[Segment(30, 40)] = ’Charlie (C)’ [Segment(22, 38)] = ’c’
[Segment(38, 40)] = ’d’
Table 2.2: Example of annotation and hypothesis segmentation for DER
2.2 Datasets for evaluation
Diarization experiments for the system developed in MATLAB was tested using two datasets -
the NDTV dataset which has been manually annotated and the REPERE dataset which was
6
Chapter 2. Evaluation of Speaker Diarization 7
used in the REPERE video diarization campaigns. The REPERE corpus was annotated during
the competition during 2012-2014.
NDTV dataset
The development dataset used for the system consists of 22 episodes of the Hindu news Headlines
Now show from the NDTV news channel. It consists manual annotations of 4h15m of English
news reading with Indian accent. The anchor is the dominant speaker (maximum active time)
in the episodes. The anchors are different across episodes. The dataset also has silence segments
with lengths varying from 1s to 5s. No advertisement jingles are present in the dataset although
the headlines of the show are announced with music in the background that is common across
episodes. In cases where active speaker is a field correspondent background noise is also present.
In the manually generated annotation the speakers in a single episode are labeled with the
following information – gender of the speaker, background environment (clean /noisy /music),
speaker ID in the episode (indicating anchor separately) in that order. The nonspeech labeled
as silence, noise, speaker pause or music consists of 7% of the total recording. Speaker overlap
has been annotated with most dominant speaker in the overlap.
REPERE dataset
The REPRE dataset was used in the French Video Diarization campaign in 2012-2014. The
French REPERE challenge was a research evaluation competition aimed at systems performing
multimodal person discovery in video recordings of broadcast news. The systems participating
had to answer (i) Who is speaking (ii) Who is present in the video (iii) What names are cited
and (iv) What names are displayed? [6]
The original REPERE corpus is composed of various TV shows from two French TV channels
that were diverse in their content - news, debates, celebrity interviews etc. It has been distributed
by ELDA (Evaluation and Language resources Distribution Agency). 60 hours of data has been
manually annotated. These annotations provide identity-labeled speech turns. The nonspeech
segments which consist of lots of advertisements and show specific music events have not been
annotated. The nonspeech duration out of the total 60 hours is 5%. Speaker overlap is annotated
with every speaker present in the overlap. The annotation is in the standard RTTM format [7].
The list of TV programs and the annotated times are mentioned in table 2.3
7
Chapter 2. Evaluation of Speaker Diarization 8
TV Programme Duration TV Programme DurationBFMTV BFMStory 20h50m LCP EntreLesLignes 5h20mBFMTV CultureEtVous 2h45m LCP LCPInfo13h30 10h03mBFMTV PlaneteShowbiz 2h15m LCP LCPInfo20h30 0h40mLCP CaVousRegarde 5h25m LCP TopQuestions 6h35mLCP PileEtFace 5h20 LCP LCPActu14h 0h30mBFMTV RuthElkrief 0h21m TOTAL 60h04m
Table 2.3: Annotated shows in REPERE corpus and their respective times
8
Chapter 3
State of the Art in Speaker
Diarization
The problem of speaker diarization involves answering “Who spoke when”. It is generally broken
down into answering “is anyone speaking?” and then answering “which speaker in the audio is
speaking?” The first step is called speech activity detection, which is a pre-processing step
common in speaker recognition, speech recognition, speech coding and speech enhancement
[8]. The latter problem can be approached as finding the change in speaker (called speaker
segmentation) and then combining the contiguous segments belonging to the same speaker under
a unique label (called speaker clustering).
Initially in the late 1990’s, when research in diarization was still in its nascent stages, few
systems attempted to perform speech activity detection as a by-product of the segmentation and
clustering [8]. Nonspeech was thought to be just another speaker. But owing to the acoustic
variability of nonspeech, systems with explicit speech activity detectors performed much better.
Often, the speaker segmentation and speaker clustering are performed iteratively and hence
shown as a single block [9] as in figure 3.1.
In this chapter previously used methods in speaker diarization have been reviewed and the
state-of-the-art algorithms implemented by various systems specialized in diarization of broad-
cast news, meeting recordings and telephone conversations are compared. In the recent years,
the National Institute of Science and Technology (NIST), USA have organised rich transcrip-
tion tasks for broadcast news and telephone diarization (2003-‘04) and for meeting diarization
(2005, ‘07, ‘09). The Albayzin campaign of 2010, the ESTER (2008) [10] and REPERE (2012-14)
broadcast audio and video diarization campaigns have fueled research in broadcast news diariza-
tion and attracted developers to participate with their diarization engines to set up benchmarks.
Some of these competitor systems have also been reviewed in this chapter.
9
Chapter 3. State of the Art in Speaker Diarization 10
Figure 3.1: A typical speaker diarization system
3.1 Feature Extraction for Speaker Diarization
For the task of speaker diarization, acoustic features that discriminate speaker information in the
spectrogram but are invariant to the phone sequence being uttered are desired. Mel-frequency
cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients, although not
designed to distinguish between speakers, have been used widely in the areas of speaker verifica-
tion and speaker recognition. Since a similar task of modelling speaker information is tackled in
speaker diarization, MFCCs and other cepstral features are the most commonly used features.
During speaker segmentation 12-19 MFCCs have been used along with the short time energy,
while during clustering usage of higher order derivatives of these MFCCs has been reported
[11]. LFCCs extracted using a linear filter bank instead of the Mel scale filter bank [12] and
Linear Prediction Cepstral Coefficients (LPCCs) [13] have also been tested but no conclusion
has been reached regarding the better performance of either. Typical sizes of analyses windows
are 25-30ms with frame hops of 10ms.
For speech activity detection, acoustic features that discriminate between speech and non-
speech are sought after. Features such as energy [13], zero-crossing rate, spectral centroid,
spectral roll-off and spectral flux [14] have been used previously in speech activity detection.
However the use of these feature vectors has always been seen in concatenation with cepstral
features.
Other than the above mentioned short time analysis features, 4Hz modulation frequency
features that convey long term characteristics of the acoustic signal have also been investigated
10
Chapter 3. State of the Art in Speaker Diarization 11
[15], and have been applied in the speaker overlap detection and speech activity detection. A
major challenge faced in these features though is the high dimensionality of the features and
the computational cost associated with it. Long term cumulative features drawn over texture
windows of 500ms such as median of pitch, long time average spectrum, deviation of the 4th
and 5th formants, harmonics to noise ratio, formant dispersion etc. have shown to be of use for
fast cluster initialization [9], while features providing vocal source and vocal tract information
[16] have shown better speaker discrimination when used along with MFCCs.
Recently Slaney et al. used features derived as activations of the bottleneck layer of a
neural network. The artificial neural network was trained to discriminate 500ms segments as
belonging to same or different speaker [17]. In another work [18], a 50% relative improvement was
reported for speech activity detection on a large Youtube corpus when a two dimensional soft-
max activation of a deep neural network was concatenated with 13 MFCC. Another interesting
feature space explored in 2011 have sacrificed diarization error only slightly to obtain a 10x speed-
up using binary valued features for performing clustering [19]. In this work, acoustic MFCC
features of segments are transformed into a binary feature space using likelihoods obtained from
GMMs.
3.2 Segmentation
In audio segmentation, the task is to create homogeneous and contiguous chunks of audio that
show dissimilarity from its neighbouring segments. It is also called acoustic change detection.
We will look at two approaches to audio segmentation with more focus on methods used in
speaker segmentation applied to speaker diarization.
3.2.1 Metric based segmentation
One of the most common audio segmentation methods to date is metric-based segmentation.
These methods are very popular in music segmentation tasks as well. In metric based segmenta-
tion a distance metric is first defined between two audio segments, that indicated their similarity.
Then a change detection strategy is implemented using this metric. Compared to model based
methods, these methods have great advantage since they do not need any information about the
data a priori.
For music segmentation, distances are calculated between the feature directly. In speech
processing however, the features (generally cepstral features) used are not suitable for frame-
wise distance computation for comparing speaker similarity, due to their variability with the
11
Chapter 3. State of the Art in Speaker Diarization 12
phones uttered. To aggregate speaker information from longer segments, it is assumed that
the features of every segment come from a probability distribution. Distance comparison is
done between these probability distributions using statistical similarity measures such as the
KL divergence, Cross Likelihood Ratio, Bayesian Inference Criterion etc. The most commonly
used probability distribution for modeling chunks of feature vectors during speaker segmentation
is the full covariance multivariate Gaussian distribution.
Bayesian Inference Criterion
Bayesian Inference Criterion (∆BIC) is a model selection criterion i.e., is a statistical criterion
that compares available models for representing the data. The aim during this selection is to
calculate if there is any over-fitting. For a set of vectors X, the BIC of the model M is one such
criterion defined as:
BIC(X,M) = logP (X|M)− λ ·#(M) · logN
The first term calculates the likelihood of the data given the model, whereas the latter term
penalizes it proportional to the number of parameters #(M) that the model uses and the size
of available data on which the model was trained. The second term is called the complexity of
the model.
The BIC can be applied to indicate whether the two sets of feature vectors being compared
for similarity are drawn from the same distribution or from different distributions. To measure
similarity between blocks X1 and X2 the following hypotheses need to be compared: H0: The
feature vectors from X1 and X2 are from same distribution and H1: The feature vectors from
X1 and X2 are from separate distributions. Let the model for H0 be M which would be trained
on X i.e., X1 concatenated with X2 and let the models for H1 be M1 and M2 for X1 and X2
respectively. We define
∆BIC = BIC(M)−BIC(M1)−BIC(M2)
A positive value for ∆BICsuggests dissimilarity between the two blocks X1 and X2 and hence
indicates that there is a change between segments X1 and X2.
Chen et al. built a completely unsupervised system using ∆BICand their method has been
replicated a number of times for speaker/ environment change detection [21] in the speaker
diarization domain. Improvements were made by [22] and [23] to make faster implementations
of change detection using the BIC approach by reducing the number of computations with some
compromise on accuracy.
Metric based segmentation methods are implemented in two strategies, a fixed sliding window
12
Chapter 3. State of the Art in Speaker Diarization 13
Figure 3.2: BIC for change detection [20]
strategy and a growing window search strategy. In the former, there is a window of fixed size,
the centre of which is being inspected for a change [24]. If the feature vectors on either side of the
midpoint are better modeled by separate distributions, resulting in a higher distance between
distributions, the midpoint is declared as a change point. The size of the sliding window is
typically 5s and the two 2.5s segments are compared for a similarity. With a larger window size
the two segments would be modeled better. However it would have higher chances of missing
a change if more change points enter the window under consideration, since the probability
distribution estimated on the segment may get contaminated.
When implementing the growing window search strategy, a single change is pursued from
the start of the recording in a window of certain size generally about 5s. If no change is detected
in this window, the size of the window is increased and a change is searched in the new window.
After the first change is detected, the search is reset from the last change detected. The growing
window method has been reported with the BIC metric [21]. In recent years, the [25] and [26]
systems have replicated the growing window BIC segmentation followed by a BIC clustering
that merges only consecutive segments to reduce false alarm speaker changes.
3.2.2 Model based segmentation
Model based segmentation methods train a GMM for every segmentation class. These GMMs
are used as a PDF in a hidden Markov model (HMM) where each state is connected to every
other state with equal transition probability. The Viterbi decoding using this HMM gives a
segmentation of the audio recording. A major disadvantage of the model based segmentation
is that the GMMs needs to be known before hand, hence need some external training data.
13
Chapter 3. State of the Art in Speaker Diarization 14
Figure 3.3: Sliding window search for speaker change detection. Distance is computedbetween two halves of the sliding window and plotted with time. Peaks in the distanceindicate a change [23].
14
Chapter 3. State of the Art in Speaker Diarization 15
Figure 3.4: Growing window search for speaker change detection. Every search is for asingle change point. Search is reset after the change is found [23].
However when segmentation and clustering are performed iteratively, the output produced by
the speaker clustering algorithm gives a set of speaker segments as training data for GMMs of
the next iteration to refine speaker segmentation. A pre-clustering is often performed to get
an initial grouping of audio segments [9] with each grouping showing resemblance of speaker
information.
Often model based segmentation methods have been used only as a post processing step to
achieve a refined segmentation [27]. Such model based techniques are famous in segmentation
during speech activity detection [25] where the acoustic change being looked at is between speech
and nonspeech. In some telephone diarization systems [20], there has been a pre-segmentation
of the audio recording based on bandwidth and gender using GMMs trained for each of the 4
classes (2 bandwidths x 2 genders).
The model based segmentation methods are more famous in meeting diarization systems.
With the advent of better clustering algorithms with i-vector speaker models, the focus has
shifted to performing speaker segmentation and speaker clustering separately for diarization.
However use of the GMM-HMM framework for refinement is still popular [25]
15
Chapter 3. State of the Art in Speaker Diarization 16
3.3 Speech Activity Detection
The task of finding contiguous segments of speech in an audio and segregating them from other
types of sounds is called speech activity detection (SAD). It is beneficial for speech processing
systems since it is practical to process only speech segments rather than entire recordings.
It makes a design more efficient by saving computation time and resources. Apart from the
computational advantages, the absence of an SAD often causes insertion errors in ASR systems.
Hence speech activity detection is a fundamental task in almost all fields of speech processing -
coding, enhancement and recognition [8]
In speaker diarization, the error metric itself highlights the need for a speech activity detec-
tor since missed speech and false alarm speech are included in the diarization error rate metric.
Moreover, with limited speaker data from small speech segments, presence of non-speech con-
taminates the estimated speaker models thereby affecting the performance of the diarization
system. Initial approaches to diarization tried to let SAD be a by-product of the diarization
system [8] by letting nonspeech be a single cluster which would be discarded at the end. However
it was soon noticed that systems having an explicit SAD gave better results.
SAD is often performed using frame-wise classification. Statistical models are trained and
estimated on a feature space most suitable for discriminating the speech and nonspeech classes.
In most cases, Gaussian mixture models are the statistical models used and the feature space
is in most cases cepstral features. Some works have reported use of acoustic features such as
energy [13], zero-crossing rate [28], spectral flux [14]. A few speech activity detectors that were
used in previous diarization competitions and campaigns for both the meeting and broadcast
news domain have been reviewed in subsections 3.3.1 & 3.3.2.
3.3.1 Systems participating in NIST-RT evaluations
The NIST organised rich transcription evaluations which are now the current benchmark in
meeting diarization. These benchmarks consist of results obtained by four participant systems
[8]. Typically 1-3% missed speech error and 2-4% false alarm speech rates are the state-of-the-art
in speech activity detection.
The SHoUT diarization toolkit for SAD [29] uses a bootstrap segmentation performed using
speech and nonspeech models pre-trained on a dataset for Dutch broadcast news. It is followed
by an iterative classification using a Viterbi decoder on 1 HMM with 2 states representing speech
and nonspeech. The use of an HMM allows to control the minimum duration of the speech and
nonspeech thereby preventing sporadic transitions from one class to another. The system uses
16
Chapter 3. State of the Art in Speaker Diarization 17
Figure 3.5: Model based SAD from SHoUT toolkit [29] using a GMM-HMM system
17
Chapter 3. State of the Art in Speaker Diarization 18
12 MFCCs concatenated with zero crossing rate and their first and second derivatives in a 39
dimensional feature vector. The system was used by ICSI [9] and LIA-Eurecom[12], although
the feature vectors used by the latter team for the iterative classification consisted of linear
frequency cepstral coefficients (LFCC).
The UPC system [30] made use of modified support vector machines (Proximal SVMs) with
Gaussian kernels to segregate the speech and nonspeech in the audio. The modification allowed
for faster retraining of SVMs as suited for an iterative classification.
The IIR-NTU [13] system performed a bootstrap segmentation based on an energy derived
confidence score. An iterative classification with GMMs trained for speech and nonspeech, using
high confidence frames from the bootstrap segmentation, followed the initial segmentation to
refine the speech and nonspeech classes. The authors reported use of Linear Prediction Cep-
strum Coefficients [13] for both the bootstrap segmentation and the iterative classification. This
approach was completely independent of external training data for the speech and nonspeech
models
3.3.2 Broadcast news systems
In the LIUM diarization toolkit [25], the authors developed a model based segmentation system
for speech activity detection using an 8 state HMM with 2 states of silence (wide and narrow
band), 3 states of wide band speech (clean, over noise or over music), 1 state of narrow band
speech, 1 state of jingles, and 1 state of music. Each state is modeled with a GMM of size 64 of
MFCCs, their deltas and delta-deltas. All the models were trained using the extensive data for
each model from the ESTER1 dataset. This system resulted in a 1.1% false alarm speech and
3.9% missed speech on the dev0 subset of the REPERE corpus. Their results on other databases
ESTER2 and ETAPE are reported in [25]. Besides speech activity detection the LIUM toolkit
also performs a gender and bandwidth detection. This also uses a model based segmentation
with 128 sized diagonal GMMs for each of the 4 classes (2 genders x 2 bandwidths) and a feature
warping.
The Albayzin 2010 campaign saw five competing systems. The best results for SAD were
reported by [14]. Although the DER was much worse than others (55% DER), their SAD error
stood best at 3.4% (1.1% missed and 2.3% false alarm). The authors reported using multi-layer
perceptrons instead of GMMs to model emission probabilities of a 5 state hybrid NN-HMM
system. The feature space was also expanded. 16 MFCCs concatenated with 8 other audio
features - energy, zero-crossing rate, spectral centroid, spectral roll-off, maximum normalized
correlation coefficient and its frequency, harmonicity measure and spectral flux. Information
18
Chapter 3. State of the Art in Speaker Diarization 19
regarding other participating systems in the Albayzin campaign is mentioned in [10].
In the REPERE 2012-2014 evaluations, three consortia took part - SODA, QCompere and
PERCOL [31]. The SODA consortia used the LIUM toolkit above. The QCompere system had a
4 state HMM similar to the LIUM toolkit one state each for speech, silence, noise and music [32]
modelled by GMMs of size 64. The PERCOL system [33] performed a 3 class GMM based SAD.
Interestingly their 3 classes were non-speech, overlapping speech and non-overlapping speech,
each modeled by 256 sized GMMs trained from the ETAPE corpus. The overlap detection
reportedly also improved the DER than the baseline clustering system.
3.4 Clustering
Clustering is a common problem in statistical data analysis. It has been addressed in many
scientific fields right from exploratory data mining to community detection in social networks. It
is the process of grouping a set of objects such that objects in each group, called cluster, are more
similar to each other than they are to objects in other groups or clusters. The objects could be
points in a vector space or even statistical models. The similarity mentioned above is a distance-
like measure defined between the objects by the user. The word similarity is used because the
measure defined need not satisfy all the properties of a norm viz. non-negativity, triangle
inequality and symmetry. The words similarity and distance have been used interchangeably
here, with less distance meaning more similarity and vice versa.
The process of clustering is generally translation invariant and hence the relative position
of the objects in their space is more relevant rather than the objects themselves. Indeed this
relative position of the objects is indicative of pairwise similarity. For the problem of speaker
diarization, the aim is to perform clustering of segments of audio based on the active speaker in
each segment. Each cluster should ideally represent a single speaker.
The dimensionality of the spectrogram of a single segment is large and comparison between
these segments based on their spectrogram is not computationally viable. Hence the segment
needs to be quantified in a low dimensional space to compare the similarity of their speaker
information. A few speaker models that have been utilized in the past in the fields of speaker
verification and speaker recognition have been reviewed in section 3.4.1. Every speech segment
would have a representative vector or a statistical model which is characterizes the speaker
information. Clustering is performed on these speaker models. Section 3.4.2 reviews a traditional
clustering algorithm and a state-of-the-art algorithm based on a graphical approach.
19
Chapter 3. State of the Art in Speaker Diarization 20
3.4.1 Speaker models for clustering
Since speaker diarization needs to capture speaker information from audio segments, speaker
models commonly used in speaker verification and speaker recognition are adopted. The two
main speaker models GMMs and i-vectors are explained below. Off these the i-vectors have
recently become state of the art in speaker verification tasks.
Gaussian Mixture Models
Gaussian Mixture Models (GMMs) of cepstral features are often used to model speakers. A
Gaussian mixture model is a popular tool for modeling multi-modal data and possess the fol-
lowing form.
p(x) =N∑i=1
wiN (µi,Σi, x) s.t.
N∑i=1
wi = 1 (3.1)
Since the segment durations can be small, the number of feature vectors available from
a segment is sometimes insufficient to estimate a full Gaussian mixture model. To overcome
this problem pre-trained Universal Background Model (UBM) is adapted for the segment to
obtain its speaker model [34]. The UBM is a comprehensive model for data from multiple
speakers combined together that captures variability of speech. For GMMs of cepstral features,
different statistical similarity measures have been investigated earlier such as the symmetric
KL divergence, normalized cross likelihood ratio (NCLR) etc. [35]. The KL divergence is an
information theoretic measure of how different the two probability distributions are from each
other, while the cross likelihood ratio compare P (X1|M2) and P (X2|M1) (equation 3.2 & 3.3).
CLR(X1, X2) = logP (X1|M1)
P (X1|M2)+ log
P (X2|M2)
P (X2|M1)(3.2)
NCLR(X1, X2) =1
|X1|log
P (X1|M1)
P (X1|M2)+
1
|X2|log
P (X2|M2)
P (X2|M1)(3.3)
where Mi is the model estimated on Xi. As we can see, if feature vectors of segment X1 and
X2 come from the same speaker, X1 it fits the model of segment X2 well, so the cross likelihood
increases, decreasing the distance.
Recent experiments in the SV and SR fields noted that only the means of gaussians in a GMM
contain most speaker related information. Due to the high variability of the covariance matrices
and mixture weights with respect to utterances [34], they are not reliable indicators of speaker
information. Hence, instead of calculating the above likelihood scores, the means of a GMM are
20
Chapter 3. State of the Art in Speaker Diarization 21
concatenated to get a single vector (called the GMM supervector) in a high dimensional vector
space. Distance measures such as the cosine distance and Mahalanobis distance [36] have been
investigated on this space. To make a comparison between two GMM supervectors they need to
be adapted from the same UBM to make sure that the corresponding mean vectors of the GMM
are being compared between segments. The adaptation algorithm is detailed in [34].
I-vector representation
The concept of i-vectors was first introduced in speaker verification as a feature extraction
from GMMs to reduce the dimensionality of the GMM hyper parameters. With the UBM sizes
being of the order of 512, 1024 or even 2048 Gaussians in some GMM systems, the size of the
supervector becomes very large to do further computation on the supervector. Instead, using
factor analysis for reducing the dimensionality of the supervector led to a new representative
vector with a few hundred dimensions. This subspace, called the total variation subspace, is
hypothesized to contain spectral information of the speaker and background.
m = M + Tx (3.4)
where m is the mean adapted supervector of the utterance for which the i-vector x is sought. M
is the mean supervector of the UBM. The matrix T is a tall low rank matrix representing the
total variability subspace which needs to be learned on a training dataset. Although supervectors
typically have tens of thousands of dimensions, this representation constrains all supervectors
to lie in an affine subspace of the supervector space. The dimension of the affine subspace is at
most a few hundred.
i-vector extraction requires speaker labeled training data with multiple utterances of the same
speaker with possible variations in utterances in terms of their phonetic balance and background
noise. The training algorithm for the total variability subspace [36] and the i-vector extraction
from the Baum-Welch statistics of the utterance have been implemented in the MSR Identity
toolkit [37].
3.4.2 Clustering Algorithms
Given the similarity matrix between the speaker GMMs or i-vectors, a clustering algorithm aims
at reaching the best set of clusters with minimum intra-cluster variance and maximum inter-
cluster variance We will look at two clustering algorithms used previously in diarization – (i)
hierarchical agglomerative clustering (HAC) and (ii) integer linear program (ILP) clustering,
21
Chapter 3. State of the Art in Speaker Diarization 22
Figure 3.6: Hierarchical agglomerative clustering
which use different solving techniques and also have different criteria for arriving at the the best
set of speaker clusters.
Hierarchical Agglomerative Clustering
HAC is a greedy algorithm i.e. it makes a locally optimal choice at each stage with the hope of
finding the global optimum.In an iterative process, the 2 most similar clusters are merged into a
single cluster. The number of clusters reduces by 1 at each step. This iterative process continues
until only one cluster remains. While merging 2 clusters, the data from segments corresponding
to the 2 clusters is concatenated and a single speaker model is re-calculated on it. The distances
of every other cluster with this newly formed cluster are re-calculated to update the similarity
matrix for the next step.
Step 0: Calculate similarity matrix (Xi,Xj)
Step 1: Find the i∗ and j∗ such that i∗ 6= j∗ & (Xi∗ , Xj∗) = mini,j(Xi, Xj)
Step 2: (Merge step) Replace Xi∗ and Xj∗ by a single object Xk∗ . k∗ = min(i∗, j∗)
Step 3: Update similarity matrix (Xi,Xk∗).
Step 4: If number of clusters > 1, Go to step 1.
Step 5: Calculate best set of clusters using optimality criterion
The optimal set of clusters is chosen based on an optimality criterion. One optimality
criterion is to choose the set of clusters where the minimum inter cluster distance is greater
22
Chapter 3. State of the Art in Speaker Diarization 23
than a threshold. Another criterion was proposed by Nguyen [38] in which from among the
clusters from every iteration, the set of clusters where the histograms of intra-cluster distances
and inter-cluster distances are farthest from each other is chosen.
argmax k mini 6=j(X(k)i , X
(k)j ) ≥ θ (3.5)
where (X(k)i , X
(k)j ) is the similarity matrix in the kth iteration
argmax|minter −mintra|√
σ2interninter
+σ2intranintra
(3.6)
where minter, σinter and ninter are the mean, standard deviation and number of elements in the
inter-cluster distances and similarly for the intra-cluster distances.
The ICSI system [9] performed an initial clustering of segments of 1s duration using prosodic
long term features. This segmentation was then refined iteratively in a GMM-HMM framework
over MFCC feature vectors derived from segments within each cluster. Each state of the HMM
represented a speaker and was modelled by a GMM. The use of the HMM allowed adding a
constraint for obtaining contiguous speech turns of 2s. In each iteration, the number of clusters
was reduced by merging state GMMs in a HAC using the BIC distance. The IIR-NTU system [13]
used LPCCs to form 30 initial clusters from uniformly dividing the audio, and then iteratively
used the same GMM-HMM framework to perform HAC with CLR distance. The LIA-Eurecom
system [12] uses the same framework, although in their method, they took a top-down approach
(also called divisive clustering) instead of HAC and performed a splitting of states starting from
a single state. Many other systems have been implemented that use the HAC approach to
clustering with different distances [19, 39].
i-vector speaker models adopted from speaker verification were first used in speaker diariza-
tion of telephone conversations where the number of speakers in the recording was known a
priori and hence k-means clustering of i-vectors was performed to arrive at 2 clusters of i-vectors
[40] and [41]. Later, an HAC like clustering of i-vectors for broadcast news were reported in [42]
which demonstrated better performance over traditional BIC based GMM HAC architecture.
ILP based clustering
In an effort by Rouvier and Meigner in 2012, a global optimization approach was proposed to
perform speaker clustering [43] using an integer linear program (ILP). Clustering is posed as a
combinatorial optimization problem on a complete graph (each node connected to every other
node). The speaker segments are considered as nodes of the graph and the incidence matrix is
23
Chapter 3. State of the Art in Speaker Diarization 24
the similarity matrix. The integer linear program to find the optimal clustering is a variation
of the k-centres problem. In simple words, the k-centres problem is to choose K cities out of N
for building warehouses so that the worst case distance between a city and its closest warehouse
is minimized. The ILP is adapted for unsupervised speaker diarization since the number of
speakers (K) in not known a priori.
With the introduction of ILP clustering in the broadcast news domain, diarization systems
now typically perform segmentation and clustering separately [10], and perform a post processing
step of Viterbi decoding using the GMM-HMM framework. Recently in 2014, Meigner et al.
[44] improved upon the above mentioned ILP framework to reduce the redundancies in the
constraints, making clustering extremely fast. Hence, in the recent years, the ILP based methods
are in the limelight over the traditional HAC based clustering due to their better performance in
terms of speed and accuracy. The LIUM toolkit reported a 17.19% DER with GMM based CLR
clustering and 15.46% DER with i-vector based ILP clustering on the REPERE corpus [44].
Summary
In this chapter a few background concepts used in segmentation, clustering and speech activity
detection have been presented, and previously implemented techniques in speaker diarization
literature are reviewed. Other than the speaker diarization task, where the number of speakers
and the presence of specific speakers is not known a priori, there have been many specialized
audio indexing tasks that have been investigated in the past. For example explicitly detecting
the presence of music [45], helping find the structure of a broadcast program [46] or locating
commercials to eliminate unwanted audio [47]. In another work, making use of speech tran-
scriptions, fast speaker change detection was applied [48]. More generic systems specialized for
different domains include the Alize speaker recognition toolkit which has a diarization sub-block,
the SHoUT toolkit [29] that was designed for meeting diarization.. INRIA, IDIAP, and DiarTK
are some other toolkits that are still under development for diarization.
24
Chapter 4
System Description and Evaluation
A complete end-to-end system has been developed in MATLAB that performs speaker diarization
of audio recordings. This system has been tested and evaluated on data from broadcast news
recordings and debate audios from two corpora and using evaluation metrics described in Chapter
2. This chapter gives a complete description of the system right from the feature extraction to
error calculation. The choice of models and parameters is explained through the evaluations
section for each subsystem. The parameters were optimized for broadcast news using the NDTV
dataset as the development set. The system has also been tested on the REPERE corpus for
comparison with broadcast diarization systems previously mentioned in literature.
Meigner et al did a comparative analysis of two approaches for diarization and discussed
the pros and cons of both the step-by-step and integrated diarization systems [49]. In the
step-by-step approach to diarization, the diarization is performed in a single pass through the
speech activity detector, the speaker segmenter and the speaker clustering subsystems. In the
integrated approach to diarization the information from the clustering algorithm is used to refine
the speaker segmentation, and clustering is performed iteratively. With the advent of better
clustering algorithms with i-vector speaker models, the focus has shifted to performing speaker
segmentation and speaker clustering separately for diarization. However use of the GMM-HMM
framework for refinement is still popular [25] Figure 4.1 shows a block diagram of the proposed
system which follows the step-by-step diarization approach.
Figure 4.1: Block diagram of proposed MATLAB system
25
Chapter 4. System Description and Evaluation 26
4.1 Feature Extraction
MFCC features have been repeatedly used in diarization research for all subsystems including
speech activity detection, speaker segmentation and speaker clustering. The proposed system
uses the first 19 MFCC features for processing in all three subsystems. In addition to these
features, short time energy and zero crossing rate and their first and second order derivatives
have been used in the SAD. The speaker segmentation subsystem uses only the 19 MFCCs and
short term energy, whereas the clustering uses their first and second order derivatives as well.
The frame dimensions of the analysis windows are 30 ms with a 20 ms frame overlap.
4.2 Speech Activity Detection
Speech activity detection is the task of separating speech from nonspeech in an audio recording.
While designing a speech activity detector as a precursor to diarization, the challenges faced are
two-fold – achieving (i) minimum missed speech and (ii) minimum false alarm speech. Percent
speech misclassified as nonspeech by the SAD is call missed speech rate (MSR), whereas percent
nonspeech misclassified as speech is called false alarm speech rate (FASR). Indeed these two are
the evaluation metrics for SAD. Typically 1-3% missed speech error and 2-4% false alarm speech
rates are the state-of-the-art in speech activity detection. If the diarization system is designed
to act as a precursor for systems such as ASR or key-word spotting, missed speech errors would
lead to deletion errors in the ASR system, while the false alarm speech might lead to insertion
errors. For SAD as a block in diarization, false alarm speech often leads to contamination of
speaker models during clustering and segmentation and affects the clustering process.
A common approach in speech activity detection is to attempt to classify all types of sounds
that are present in the recording. If the data being diarized is known beforehand and has some
peculiarities, such as audio indicators for marking sections in an episode, it becomes possible
to train statistical models for these markers in the data and classification is straightforward [2].
Unfortunately while developing generic systems, we do not have the luxury of possessing such
markers or sound effects to be expected in the input audio. In this section, the problem of
estimating non-speech models for SAD with no prior information about the data is addressed.
Sounds other than speech which are often seen in audio databases are either human produced
such as fillers, lip-smacks, laughter, clapping etc. or instrument sounds such as music, jingles
etc. Silence regions and pauses taken by speakers also form significant portions of most audio
databases. These sounds together form the non-speech category. Model based approaches are
popular in SAD where statistical models are trained for speech and nonspeech from external
26
Chapter 4. System Description and Evaluation 27
data. The drawback of these systems however is their reliance on acoustic conditions of out of
sample data. Hybrid systems [9, 12] make use of a classifier trained on external data to obtain
an initial bootstrap segmentation and then the models for speech and nonspeech are refined
iteratively over the audio to be segmented to adapt to the acoustic variations in nonspeech.
The bootstrap classification provides a subset of feature vectors of the recording being pro-
cessed that best represent the speech and nonspeech. Class models are initialized over these
token subsets on the feature space and frame-wise iterative classification is performed to refine
the classes. The bootstrap segmentation is generally performed on a smaller set of features which
are chosen based on heuristic information known about the speech and nonspeech classes.
4.2.1 Speech Activity Detection Algorithm
The speech activity detector in the proposed system is a model based classifier. It is independent
of external training data for modeling the nonspeech and speech classes. The approach to such
a model based speech activity detector is inspired by the SAD in the IIR-NTU submission to the
NIST RT2009 evaluations [13]. In our system speech activity detection is done in two decoupled
steps. First, silence is removed from the whole recording using an energy based bootstrapping
followed by iterative classification. In the second step, music and other audible nonspeech are
identified from the recording. For music removal the silence removed audio is fed to a music vs.
speech bootstrap discriminator. The frames of the audio which are music with a high confidence
are used to train a music model which is iteratively refined. In both steps, only segments with
duration 1s or longer have been labeled as nonspeech in order to avoid sporadic nonspeech
to speech transitions. This constraint are incorporated in [25] and [29] using a GMM-HMM
framework.
Silence Removal
The silence removal in the proposed system is done using 19 MFCC features concatenated with
short time energy(STE) and their first and second derivatives. A bootstrap segmentation assigns
a confidence value to every frame for both silence and speech classes. The bootstrap silence model
is trained using a Gaussian mixture of size 4 over the 60 dimensional feature space. A speech
model is also trained with the same size from high confidence speech frames.
In an iterative classification step, each frame is classified into two classes viz. speech and
silence. The high confidence speech and silence frames from these are used to train the speech
and silence models for the next iteration. As the number of iterations increase, the number of
60 dimensional Gaussians used to model the speech and silence GMMs are increased until a
27
Chapter 4. System Description and Evaluation 28
Figure 4.2: Silence removal using energy based bootstrapping and iterative classification
28
Chapter 4. System Description and Evaluation 29
maximum. The best results were obtained when the size of Gaussian mixtures was limited at 32
for speech and 16 for nonspeech. This results in removal of silences and pauses, but high energy
nonspeech, also called audible nonspeech such as and jingles and music are classified as speech,
since the MFCCs and frame energy for music resemble speech more than silence.
Music removal
In 2005 [28] introduced a model fitting based music v/s speech classifier that reported a clas-
sification accuracy of 95%. The authors pre-segmented the audio recording into chunks of 1s
and extracted 50 feature vectors over 20ms windows. These feature vectors were 2 dimensional
- (i) short time energy and (ii) zero crossing rate of the windowed signal. A histogram of short
time energy(STE) and zero crossing rate(ZCR) is computed for the 1s chunk and compared with
the model histograms of speech and music which were derived from a large database of music
data and speech data. These ideal histograms were modeled with χ2 distributions. The chunk
is labeled music or speech after a comparison between the histogram of the concerned 1s chunk
and the χ2 models.
The music speech discriminator [28] fails when speech and music are present together. In
news broadcast archives, it is often the case that the most information dense parts of the archive
such as episode headlines have a characteristic background music that is specific to the show.
Taking this into consideration, porting the system as is would not just result in missed speech,
but would cause loss of highly informative speech data. Hence we use the output of the classifier
as a bootstrap segmentation. Initial estimate models for music and speech are trained from
high confidence frames of both classes. An iterative classification similar to the silence removal
system is done to refine the speech and music classes so as to discard music only segments. The
features used are nineteen MFCCs concatenated with the zero crossing rate and their first and
second derivatives. The short time energy is not used during the iterative classification step. It
was observed that only after neglecting the short time energy the speech with background music
which was classified as music was recovered to the speech class.
4.2.2 Confidence measures for Speech Activity Detection
During the silence removal, a histogram of the energy of the frames is used to rank all frames
according to the energies. The frames with 20% lowest energies are called high confidence silence
frames whereas the frames with 10% highest energies are speech with a high confidence. Hence
in every iteration, only these frames are used for the training of the GMMs. For the music
removal, the aim is to rake out from the frames that have speech with music as background but
29
Chapter 4. System Description and Evaluation 30
Figure 4.3: Music removal using a music-speech discriminator for bootstrapping
30
Chapter 4. System Description and Evaluation 31
that are classified as nonspeech. Hence we only take the 40% highest zero crossing rate frames
from the ZCR histogram as high confidence music frames and train the music model.
4.3 Evaluation of Speech Activity Detection
The NDTV dataset was used as a development set to tune parameters for both silence and music
removal. These parameters were used to obtain results on the REPERE dataset as well.
In this section, results for 3 separate experiments has been shown. The effect of the size of
the GMM in the iterative clustering step of SAD, the shape of the covariance matrix of the GMM
and effect of cascading the 2 systems. Error has been obtained using the Pyannote [5] library in
python. Missed speech rate (MSR) is the percentage of time of the audio for which speech was
misclassified as nonspeech. Whereas, false alarm speech rate (FASR) is the percentage of time
the audio was misclassified as speech. The SAD error is the sum of the MSR and FASR.
4.3.1 Evaluation on the NDTV dataset
Since the two blocks of silence and music removal are decoupled, the set of system parameters
is chosen for the former and the output from this bock is fed to the music removal. Hence the
evaluation of the music removal is done separately.
Size of GMM in Iterative Clustering during Silence Removal
After the bootstrap segmentation, the models trained from limited speech and nonspeech frames
are GMMs of size 4. As the amount of data available to both the classes increases, the size of the
GMMs used to model the speech and silence is increased. In every iteration the size is doubled
until it reaches a maximum. In this experiment, we choose the size of the GMM that is best
suited for both the classes. The table shows % SAD error over 22 episodes of NDTV dataset.
Best results of were obtained with the 32GMM for speech and 16GMM for silence. For the
combination (32,64) the iterations did not converge and gave a high MSR in every iteration. A
possible justification is that the silence model is a mixed model with some Gaussian components
representing speech frames and others representing silence frames. For the combination (4,64)
the FASR was very high, since of the two competing models, the speech model always resulted
in a high likelihood for most frames and hence managed to capture silence through some of its
Gaussians.
31
Chapter 4. System Description and Evaluation 32
Speech8 16 32 64
Silence
4 13.42 9.82 5.94 x8 10.62 5.79 5.54 5.8916 6.44 5.74 5.2 5.4932 13.5 6.52 5.96 x
Table 4.1: Size of GMM for speech and silence model
Covariance matrix for GMM in iterative classification of silence and music
removal
In this experiment, the GMMs for speech, silence and music were modelled using 32, 16 and 16
Gaussians respectively in the final iterations of the iterative classification. Making the GMM
full covariance allows the GMM to be better trained for the speech class. This results in very
low MSR, however for the silence and music class where the number of frames are limited, a full
covariance GMM becomes an overfit, hence the FASR goes up significantly.
Full DiagonalSilence removal 5.49 5.2Music removal 8.13 7.69
Table 4.2: Shape of covariance matrix of GMM for SAD
Effect of cascading Silence removal and Music removal
Using the music removal system alone gives a high false alarm rate. The cascade of these 2
systems showed an improvement than using only either music or silence removal. As expected,
after the cascading, the MSR increased, however the FASR reduced due to classification of frames
containing only music as nonspeech.
MSR FASRTotal SADerror
Silence removal 1.49 3.71 5.20Music removal 1.59 6.11 7.69Cascade 2.3 2.71 5.01
Table 4.3: Cascading silence removal and music removal for SAD
32
Chapter 4. System Description and Evaluation 33
ZCR confidence score for training music model
The inclusion of only high confidence music frames derived using ZCR values for training the
music model and discarding the low ZCR frames resulted in a lower missed speech rate. This
indicates to the recovery of speech with music background to the speech class. Increase in FASR
is due to speech extracted from jingles.
MSR FASRTotal SADerror
High ZCR music frames 2.3 2.71 5.01All music frames 3.08 2.32 5.41
Table 4.4: Refining music model using only high ZCR confidence score frames
4.3.2 REPERE dataset results
The REPERE set was tested for SAD results on the January 2013 dev0 dataset off its 6 subsets.
For the dev0 set, which consists of 3 hours of annotations, the results using the model based
GMM-HMM segmenter of the LIUM [25] toolkit resulted in 1.1% FASR and 3.83% MSR. The
proposed MATLAB system performs at 2.2% FASR and 3.2% MSR with a total error of 5.41%
as against the 4.93% by the LIUM toolkit.
The table below shows the SAD results for the complete REPERE corpus which accounts
for 51 hours of audio annotation with 2.5 hours of nonspeech.
MSR FASRTotal SADerror
Silence removal 1.43 3.27 4.7Cascade 1.45 3.01 4.46
Table 4.5: Results for REPERE with 60 hours annotated
4.4 Speaker Segmentation
The speaker segmentation algorithm used in the proposed system is a growing window search
[21] using the ∆BICdistance as shown in Figure3.4. Starting from the beginning of the audio,
a search is done for a single speaker change. At every change found, the search is restarted from
the next frame. The search window is initialized to 5s and a ∆BICvalue is computed for each
frame in the window. If the maxima of this array becomes greater than a threshold θ, then a
33
Chapter 4. System Description and Evaluation 34
change is declared at the point of maxima. If no such maxima is located in the window, the size
of the window is increased by 2s and the same procedure is carried out until a change is detected.
However, only speech frames are processed after discarding nonspeech indicated by the speech
activity detector. After finding the change points in the speech frames, their corresponding
locations in the original audio are found and declared as change points.
In two previous broadcast diarization toolkits [25] and [29] the segmentation is carried out in
2 steps, first the ∆BICbased change detection is performed as mentioned above with threshold
as 0 and then merging of consecutive segments for which the ∆BICscore is positive. The need for
the two steps is due to the oversegmentation from the zero threshold ∆BICbased segmentation.
In order to avoid the two step process, only the maxima which were greater than a threshold θ
were chosen. This significantly reduced the oversegmentation.
∆BIC(xi) = Nlog|Σ| −N1log|Σ1| −N2log|Σ2| −λ
2(d+
1
2d(d+ 1))logN (4.1)
For the speaker segmentation, 19MFCCs with their short time eneries have been used. The
segmentation algorithm varies as O(d6) where d is the number of dimensions. Hence most
speaker segmentation systems [21, 25, 29] do not make use of derivatives of cepstral features
while performing segmentation.
4.5 Choice of Segmentation parameters
The parameters for segmentation were tuned on the NDTV dataset, in order to minimize the
error in diarization error rate. The parameters tuned were θ and λ in equation 4.1. The effect of
the parameters on the DER was studied in two experiments. First, the DER was calculated in
an ORACLE experiment [29], where the system was given the annotation for the speech activity
detection and clustering and only the speaker segmentation information from the system was
used. These experiments were inspired by those performed for the SHoUT toolkit. This tested
the system for the missed segments only, since the segments which were small and were caused
due to false alarm speaker changes would get labeled correctly in the ORACLE.
λ1 10
θ0 0.89 1.24
1000 1.40 2.372000 2.55 3.75
Table 4.6: DER with ORACLE experiment
34
Chapter 4. System Description and Evaluation 35
The effect of the false alarm speaker changes is that it increases the number of segments
and hence the size of segments is reduced. To test the effect of false alarm segments, the
best algorithm from the clustering was used (ILP clustering with i-vectors) and the DER was
calculated.
It was observed that low values of θ and λ in result in an oversegmentation. As the θ is
increased the average duration of segments increases. This enabled better speaker modeling for
the segments and resulted in lower DER when combined with the best clustering algorithm.
λ1 10
θ0 31.52 33.41
1000 23.67 16.542000 12.35 16.59
Table 4.7: DER with best clustering algorithm
Hence the combination of θ = 2000 and λ = 1 is the default in the proposed system.
4.6 Speaker Clustering
After the speaker changes have been detected using the speaker segmentation, the speaker clus-
tering subsystem aims to gather together segments from the same speakers. For this the segment
is represented with a speaker model. Pair-wise similarity is computed between all the speaker
models and a clustering algorithm is chosen to perform the grouping of segments.
4.6.1 Choice of speaker model
The proposed system has implementations of two speaker models which have been widely studied
in speaker verification and speaker recognition tasks (i) Gaussian Mixture models and (ii) i-vector
models. The GMM (equation 3.1) is a probabilistic model on the feature space. The features
used here are short time energy concatenated with 19 MFCC features and their first and second
derivatives in a 60 dimensional feature space. The similarity between GMMs is based on cross
likelihoods of model of one segment fitting the data in the other. GMM for a segment is trained
on the feature vectors of the segment using the Expectation-Maximization algorithm to obtain a
diagonal covariance GMM of size 32. While evaluating the system using GMM speaker models,
the CLR and NCLR distances (equations 3.2 and 3.3) have been tested along with HAC and
ILP clustering algorithms.
35
Chapter 4. System Description and Evaluation 36
i-vectors are vectors in Rn, were n is of there order of 100. The similarity measures are the
ones used with euclidean vectors viz. cosine distance, Mahalanobis distance etc. The i-vector
is derived from the GMM supervector 1 of the segment after a dimensionality reduction using
factor analysis [36]. The components of the i-vector are the speaker factors of the eigenvoice
vectors of the Total Variability space. The i-vector extraction process is explained below.
i-vector extraction
To obtain the i-vectors, first a speech Universal Background Model (UBM) is trained on a
training data. The UBM is a GMM with large number of gaussians, so that it captures all
possible variabilities in speech in the feature space. In the proposed system the TIMIT and
TIFR datasets have been used for the UBM training. The TIMIT set consists of 168 speakers
uttering 10 English sentences each while the TIFR set consists of 100 speakers uttering 10 Hindi
sentences each, both from native speakers of the respective languages. The UBM is a diagonal
covariance GMM of size 512. UBM training is a one time computation. The UBM is mean-
adapted for the feature vectors of the concerned segment to obtain a GMM for the segment. The
means of the UBM and the adapted segment GMM are concatenated together to get a 30720
sized supervectors (60x512).
The Total Variability space is a subspace of the GMM superspace, that captures all the
speaker and channel related information. T is the low rank matrix whose columns span the
Total variability subspace. For the proposed system, the matrix T is trained using the same
speaker labeled dataset used for UBM training. The T matrix training is also a one time
computation. The i-vector of the segment is the projection of the GMM supervectors onto the
Total Variability subspace.
m = M + Tx (4.2)
where M is the UBM supervector, m is the mean-adapted GMM supervector of the segment.
Thus for every segment, extraction of the i-vector x involves 2 steps – adapting the UBM to
obtain its GMM supervector and extracting the factors of the total variability eigenvectors to
get. The algorithm for training the T matrix from speaker labeled training data is detailed in
[36]. The proposed system uses the MSR Identity toolbox [37] for UBM training, training of the
TV subspace and the i-vector extraction.
While evaluating the system using i-vectors, the dimension of the TV subspace and the
choice of distances between the i-vectors has been examined. Two distance metrics have been
1Supervector is the vector obtained after concatenating all the mean vectors of the GMM
36
Chapter 4. System Description and Evaluation 37
Figure 4.4: Extraction of i-vectors
tested for measuring similarity between i-vectors - the cosine similarity metric (eq. 4.3) and
the Mahalanobis distance metric (eq. 4.4), where W in is the within class covariance matrix
determined from the n training i-vectors from S speakers detailed in 4.5. The Mahalanobis
distance is hence also called within class covariance normalization (WCCN). In equation 4.5 of
WCCN computation for the Mahalanobis distance, the vectors ws are mean of the ns i-vectors
of speaker s
D(x, y) = 1− xT y
‖x‖ · ‖y‖(4.3)
D(x, y) = (x− y)TW−1(x− y) (4.4)
W =1
n
S∑s=1
ns∑i=1
(wsi − ws
)(wsi − ws
)T(4.5)
4.6.2 Choice of clustering algorithm
The proposed system is equipped with two clustering algorithms – the traditional hierarchical
agglomerative clustering algorithm and a graphical clustering algorithm called the ILP algorithm
that was recently introduced to speaker diarization in 2013 by Rouvier et al. [43].
HAC
In the HAC system, in every iteration the most similar speaker models are chosen to be merged.
In the merging step, a new speaker model is estimated using the data from all segments of
both speaker models being merged. The similarity matrix is updated with similarity of the new
speaker model from other models. In every iteration the number of clusters reduces by one.
The process is continued until only one cluster remains. The optimum set of clusters is chosen
from among the outputs of every iteration based on an optimality criterion. The HAC can be
implemented using either speaker model. In each iteration of the HAC, the merging requires
37
Chapter 4. System Description and Evaluation 38
Figure 4.5: ILP clustering on a complete graph of speaker models [44].
extra computations since the model needs to be retrained, and the distance matrix needs to be
updated with entries corresponding to the merged speaker model.
There are two optimal cluster criteria which have been implemented for the HAC. With the
distance threshold criterion, where the iteration with minimum pairwise similarity greater than
a threshold is declared the optimal set of clusters (equation 3.5). In the Ts optimal cluster
criterion introduced by Nguyen [38], the set of clusters with minimum intra-cluster similarity
and maximum inter-cluster similarity is declared the optimal set of clusters using the formula
3.6
ILP clustering
In the ILP clustering, the k-centres problem is modified to obtain a set of clusters. The original
k-centers problem is to identify K cities out of N cities for building warehouses such that the
longest distance between a city and its nearest warehouse is minimized. In the case of the ILP
formulation, the N segments are similar to N cities and K of those segments are to be chosen
as best segments for the speakers data of the K speakers. However in case of diarization, the
number of speakers (K) is unknown. Hence the modification is brought out in the objective of
the optimization problem 4.6
Consider the set of binary decision variables: Xii
Xii= 1 indicates cluster i is a leader cluster.
Xij= 1 indicates cluster i is assigned to leader cluster j (and hence Xjj= 1 is necessary)
Note that Xji=1and Xij=1 have a different meaning although they both show that i vectors for
38
Chapter 4. System Description and Evaluation 39
segments i and j belong to the same cluster. Now, consider the optimization problem 4.6
min
N∑i=1
Xii +1
δ
N∑i=1
N∑j=1
dijXij
s.t.∑
Xij = 1 ∀j
Xij ≤ Xii ∀j
dijXij ≤ δ ∀i, j
Xij ∈ {0, 1} ∀i, j
(4.6)
The objective function to be minimized consists of 2 terms, the first is number of leader
clusters (number of speakers), and the second is the total dispersion of all K clusters. The first
constraint ensures that a segment is assigned to exactly 1 cluster. The second constraint ensures
that a cluster centre is assigned to the same cluster. Constraint 3 prevents assigning a vector to
a cluster farther than a threshold from its leader cluster.
Note that the ILP clustering algorithm does not require any information about the objects
being clustered and depends only on the similarity matrix. The integer program needs to be
converted to a 1-D ILP so that the intlinprog solver in MATLAB can generate a set of clusters.
The only disadvantage of using the ILP clustering algorithm is that the speaker models are
not refined iteratively in the process of obtaining the speaker models as in HAC and hence if the
sizes of the segments are small then the i-vectors chosen as leader i-vectors may not represent the
speaker information completely. This clustering algorithm can be used along with either speaker
model - GMMs or i-vectors since only the similarity matrix is needed to obtain the optimal set
of clusters.
4.7 Evaluation of Speaker Clustering
This section describes clustering experiments performed using the best outputs from the pre-
vious stages of SAD and segmentation. Section 4.6.1 and 4.6.2 present the experiments on the
traditional hierarchical clustering using GMM speaker models and i-vector speaker models re-
spectively. Section 4.6.3 and 4.6.4 present the experiments using the ILP clustering algorithm
on the GMM speaker models and i-vector speaker models in that order.
The experiments presented below were performed on the NDTV dataset. The DER values
presented are the overall diarization error rates, which are averages of the individual DER per
episode weighted with the duration of the episode.
39
Chapter 4. System Description and Evaluation 40
Figure 4.6: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels
4.7.1 HAC experiments with GMM speaker models
Hierarchical clustering is performed on a similarity matrix between clusters, and requires a
stopping criterion which decides the optimal set of clusters. Two stopping criteria have been
implemented – (i) distance threshold criterion and (ii) Ts optimal stopping criterion.
Distance threshold criterion
During HAC, as the iteration number increases, the underclustering decreases until the mini-
mum DER clustering is reached. The iterations following the set of optimal clusters, there is
overclustering. This is demonstrated in the graphs below.
Ts optimality criterion
Using the optimality criterion of Nguyen [38] given by equation 3.6, the cluster with farthest
histograms for inter-cluster distances and intra-cluster distances was chosen. Using the NCLR
distance a DER of 22.15% was attained, whereas using the CLR resulted in a DER of 19.83%
4.7.2 HAC with i-vector speaker models
Using HAC with i-vectors. New i-vectors were extracted for every segment obtained in the
cluster merging step. The best result obtained was 16.69% DER for 75 dimensional TV space
with the Mahalanobis distance.
40
Chapter 4. System Description and Evaluation 41
Figure 4.7: DER on NDTV dataset: HAC with distance threshold for GMM speakermodels
4.7.3 ILP based experiments with GMM speaker models
ILP clustering was performed using CLR and NCLR distance to construct the distance matrices.
Best result obtained was 19.03% whereas when the NCLR distance was used, the best result
obtained was the 17.27%. The x-axis denotes the threshold present in the ILP optimization
problem’s constraints. The better performance of ILP than 3.6 optimum criterion is in concur-
rence with [50] The NCLR is a better representation of the distance than the CLR, however it
is not suitable for use in HAC, since as the size of the merged cluster increases, the size of the
segment plays a role in decreasing its NCLR distance from other segments 3.2 & 3.3.
The Integer Linear Programming formulation on the other hand offers a holistic trajectory
to reach the optimum clustering. To verify this, the ILP formulation was implemented for the
CLR and NCLR similarity matrix generated using the GMM speaker models and it gives an 11%
relative improvement in the error compared to the best error from the GMM-HAC clustering
algorithm. In literature the ILP has only been tried using i-vectors.
41
Chapter 4. System Description and Evaluation 42
Figure 4.8: DER on NDTV dataset: ILP with distance threshold for GMM speaker models
4.7.4 ILP clustering with i-vector speaker models
The ILP clustering was implemented with i-vectors trained on the TIMIT+TIFR dataset. The
following experiments indicate the best dimensions for the Total Variability subspace and the
best choice of distance.
The Mahalanobis distance offers a background compensation method that enhances similar-
ity between segments from same speaker but different background
GMM v/s i-vector
The GMM based speaker modeling gives a very high dimensional representation for the segment
and hence also captures background information as well. Similarity in background could lead to
similarity between segments of different speakers. Background compensation schemes need to be
employed on the features space. On the other hand the i-vectors allow background compensation
through WCCN. Another issue with using GMM speaker models is their high computation time
for segment similarity due to the cross likelihood terms in equations 3.2 and 3.3.
HAC v/s ILP
The HAC though a greedy algorithm for clustering, works as a good approximation. However if
during a step in the clustering, an erroneous merging occurs, it affects the performance of the
later steps significantly. Since a re-estimation of the cluster needs to be done at each step, HAC
is more expensive than ILP. The ILP does a more thorough search by exploring all possible∑NK=1
(NK
)= 2N cluster combinations than the HAC.
42
Chapter 4. System Description and Evaluation 43
Figure 4.9: Performance of ILP clustering with i-vector speaker models with varyingdimensions of the Total Variability subspace. Red plot is for the Mahalanobis similarity.Blue plot for the Cosine similarity
43
Chapter 4. System Description and Evaluation 44
Table 4.8: Best results from the 2 speaker models and 2 clustering algorithms
HAC ILPGMM 19.45 17.27i-vector 17.11 16.18
4.7.5 Results on REPERE corpus
Previously indicated results for the dev0 subset of the REPERE show a 17.19% DER with
GMM speaker models and 15.46% DER with the i-vector speaker models. For the dev0 subset,
we achieved a 23.19% DER with the HAC-GMM clustering and a 21.02% DER with ILP-i-
vector clustering. The poorer performance compared to the previously attained results could be
because of smaller sized UBM models (2048 as used by LIUM [25]).
The overall DER for the 60 hour REPERE corpus is best for ILP-ivector clustering combi-
nation i.e. 24.4%.
Summary
In this chapter the proposed system and its components were described. The system has been
equipped with state of the art clustering algorithms and speaker models. The system has been
built using MFCCs as foremost feature vectors in every component, although other feature
vectors may be endeavoured. A completely unsupervised speech activity detection algorithm
has been implemented in the system that can be ported for other speech processing tasks.
The speech activity detection uses an existing music vs speech discriminator for building the
nonspeech models from the recording.
44
Chapter 5
Conclusion and Future work
5.1 Conclusion
The aim of this thesis was to study the state-of-the-art techniques in speaker diarization for
specific application to broadcast news audio recordings and develop a MATLAB based system
for the same. The proposed system has been evaluated using the diarization error rate metric
(detailed in Chapter 2) and presented with new additions in unsupervised speech activity de-
tection. The system has 3 main components viz. speech activity detector, ∆BIC based speaker
change detector and a state-of-the-art speaker clustering block. The system has been evaluated
for two news databases - NDTV dataset and the REPERE dataset.
The general purpose speech activity detector is capable of removing silences as well as audible
nonspeech such as music from a recording. The speaker clustering block allows for state-of-the-
art speaker models for representing segments with i-vectors, which can facilitate further work in
fast cross-show diarization.
Experiments were performed on two broadcast news corpora – Indian news dataset from
NDTV and the French REPERE corpus. The NDTV corpus is a 4h15m dataset from one news
show. This dataset was manually annotated for the diarization experiments. The REPERE
dataset of 60h04m was obtained from the French ELDA.
The system is capable of performing speech activity detection without dependence on ex-
ternal training data for nonspeech and speech models. Frame energy and zero crossing rate
have been used as bootstrapping features to construct silence and music models from the audio
recording being processed. A competitive speech activity detection has been achieved with a
two-stage SAD system – a silence detection, followed by a music detection. The results are
comparable to a state-of-the-art GMM-HMM based speech activity detector which uses external
45
Chapter 5. Conclusion and Future work 46
training data from a large dataset for creating nonspeech models.
The i-vector speaker models, which are now state-of-the-art in speaker verification, provide
a low dimensional representation of the speaker information compared to traditional GMM
speaker models. They also offer a computational advantage since distance computation between
i-vectors is much faster compared to cross-likelihood based similarity computation on GMM
speaker models. Hence for real-time diarization systems, i-vectors seem more appealing.
It has been verified in this thesis as indicated in [43] that speaker clustering is achieved
better using a global optimization approach to reach the optimum set of speaker clusters rather
than the traditional greedy optimization approach of the hierarchical agglomerative clustering
(HAC) algorithm. HAC is computationally very expensive, and an erroneous merge step during
the clustering significantly affects the later iterations i.e., error gets propagated. The integer
linear programming (ILP) clustering formulation on the other hand offers a holistic trajectory
to reach the optimum clustering. It is a graphical approach to clustering adapted from the
prevalent k-centres problem in combinatorial optimization. To verify the better performance of
ILP compared to HAC, the ILP formulation was implemented for the CLR and NCLR similarity
matrix generated using the GMM speaker models and it gives an 11% relative improvement in
the error compared to the best error from the GMM-HAC clustering algorithm. In literature
the ILP had only been tried using i-vectors.
5.2 Future Work
Future work on the system development should focus on the following aspects of speaker diariza-
tion:
Refinement of the diarization output by passing it through a Viterbi decoder should be
attempted.
Cross-show diarization is the task of performing speaker clustering across different recordings
to identify segments of the same speakers in different shows. Current momentum of diarization
research is along solving this problem for large databases. Cross diarization should be attempted
using the proposed MATLAB system
Improvements in ILP have shown sufficiently faster implementations by reducing the re-
dundancies in the original ILP, although MATLAB does not support solving these optimization
problems. Solvers such as GUROBI provide support to solving advanced integer linear programs.
It was observed that during the speaker clustering, the segments having background music
were unable to show similarity with segments having clean background using the MFCC-GMM
speaker models owing to the low SNR. Even after using background variability compensation
46
Chapter 5. Conclusion and Future work 47
techniques on i-vector speaker models, the problem persists. Speech enhancement and singing
voice separation prior to parameterising the audio recording should be attempted so that music
in the background of a speaker is suppressed.
47
Bibliography
[1] Inside the secret technology that makes ‘the daily show’ and ‘last week tonight’ work,
http://splitsider.com/2015/03/inside-the-secret-technology-that-makes-the-daily-show-
and-last-night-tonight-work/.
[2] Gerald Friedland, Luke Gottlieb, and Adam Janin. Joke-o-mat: browsing sitcoms punchline
by punchline. In Proceedings of the 17th ACM international conference on Multimedia, pages
1115–1116. ACM, 2009.
[3] Sue E Tranter, Douglas Reynolds, et al. An overview of automatic speaker diarization
systems. Audio, Speech, and Language Processing, IEEE Transactions on, 14(5):1557–1565,
2006.
[4] Xavier Anguera Miro. Robust speaker diarization for meetings. Universitat Politecnica de
Catalunya, 2007.
[5] Pyannote - collaborative annotation of audio-visual documents, http://pyannote.github.io/.
[6] Juliette Kahn, Olivier Galibert, Ludovic Quintard, Matthieu Carre, Aude Giraudel, and
Philippe Joly. A presentation of the repere challenge. In Content-Based Multimedia Indexing
(CBMI), 2012 10th International Workshop on, pages 1–6. IEEE, 2012.
[7] Nist: The nist rich transcription 2009 (rt’09) evaluation.
[8] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Fried-
land, and Oriol Vinyals. Speaker diarization: A review of recent research. Audio, Speech,
and Language Processing, IEEE Transactions on, 20(2):356–370, 2012.
[9] Gerald Friedland, Adam Janin, David Imseng, Xavier Anguera Miro, Luke Gottlieb, Marijn
Huijbregts, Mary Tai Knox, and Oriol Vinyals. The icsi rt-09 speaker diarization system.
Audio, Speech, and Language Processing, IEEE Transactions on, 20(2):371–381, 2012.
48
Bibliography 49
[10] Martin Zelenak, Henrik Schulz, Francisco Javier Hernando Pericas, et al. Albayzin 2010
evaluation campaign: speaker diarization. 2010.
[11] Sylvain Meignier and Teva Merlin. Lium spkdiarization: an open source toolkit for diariza-
tion. In CMU SPUD Workshop, volume 2010, 2010.
[12] Simon Bozonnet, Nicholas WD Evans, and Corinne Fredouille. The lia-eurecom rt’09
speaker diarization system: enhancements in speaker modelling and cluster purification.
In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference
on, pages 4958–4961. IEEE, 2010.
[13] T Nguyen, H Sun, S Zhao, SZK Khine, HD Tran, TLN Ma, B Ma, ES Chng, and H Li.
The iir-ntu speaker diarization systems for rt 2009. In RT’09, NIST Rich Transcription
Workshop, May 28-29, 2009, Melbourne, Florida, USA, volume 14, pages 17–40, 2009.
[14] Arlindo Veiga, Carla Lopes, and Fernando Perdigao. Speaker diarization using gaussian
mixture turns and segment matching. Proc. FALA, 2010.
[15] Hari Krishna Maganti, Petr Motlicek, and Daniel Gatica-Perez. Unsupervised speech/non-
speech detection for automatic speech recognition in meeting rooms. In Acoustics, Speech
and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4,
pages IV–1037. IEEE, 2007.
[16] Wai Nang Chan, Tan Lee, Nengheng Zheng, and Hua Ouyang. Use of vocal source features
in speaker segmentation. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006
Proceedings. 2006 IEEE International Conference on, volume 1, pages I–I. IEEE, 2006.
[17] Sree Harsha Yella, Andreas Stolcke, and Malcolm Slaney. Artificial neural network features
for speaker diarization. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages
402–406. IEEE, 2014.
[18] Neville Ryant, Mark Liberman, and Jiahong Yuan. Speech activity detection on youtube
using deep neural networks. In INTERSPEECH, pages 728–731, 2013.
[19] Xavier Anguera and Jean-Francois Bonastre. Fast speaker diarization based on binary keys.
In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pages 4428–4431. IEEE, 2011.
[20] Douglas A Reynolds and P Torres-Carrasquillo. The mit lincoln laboratory rt-04f diarization
systems: Applications to broadcast audio and telephone conversations. Technical report,
DTIC Document, 2004.
49
Bibliography 50
[21] Scott Chen and Ponani Gopalakrishnan. Speaker, environment and channel change detec-
tion and clustering via the bayesian information criterion. In Proc. DARPA Broadcast News
Transcription and Understanding Workshop, volume 8. Virginia, USA, 1998.
[22] Xavier Anguera and J Hernando. Xbic: Real-time cross probabilities measure for speaker
segmentation. Univ. California Berkeley, ICSIBerkeley Tech. Rep, 2005.
[23] Shih-Sian Cheng, Hsin-Min Wang, and Hsin-Chia Fu. Bic-based speaker segmentation using
divide-and-conquer strategies with application to speaker diarization. Audio, Speech, and
Language Processing, IEEE Transactions on, 18(1):141–157, 2010.
[24] Matthew A Siegler, Uday Jain, Bhiksha Raj, and Richard M Stern. Automatic segmenta-
tion, classification and clustering of broadcast news audio. In Proc. DARPA speech recog-
nition workshop, volume 1997, 1997.
[25] Mickael Rouvier, Gregor Dupuy, Paul Gay, Elie Khoury, Teva Merlin, and Sylvain Meignier.
An open-source state-of-the-art toolbox for broadcast news diarization. Technical report,
Idiap, 2013.
[26] Herve Bredin and Johann Poignant. Integer linear programming for speaker diarization
and cross-modal identification in tv broadcast. In the 14rd Annual Conference of the In-
ternational Speech Communication Association, INTERSPEECH, 2013.
[27] Daniel Moraru, Sylvain Meignier, Corinne Fredouille, Laurent Besacier, and Jean-Francois
Bonastre. The elisa consortium approaches in broadcast news speaker segmentation during
the nist 2003 rich transcription evaluation. In Acoustics, Speech, and Signal Processing,
2004. Proceedings.(ICASSP’04). IEEE International Conference on, volume 1, pages I–373.
IEEE, 2004.
[28] Costas Panagiotakis and George Tziritas. A speech/music discriminator based on rms and
zero-crossings. Multimedia, IEEE Transactions on, 7(1):155–166, 2005.
[29] Marijn Anthonius Henricus Huijbregts. Segmentation, diarization and speech transcription:
surprise data unraveled. 2008.
[30] Jordi Luque, Xavier Anguera, Andrey Temko, and Javier Hernando. Speaker diarization
for conference room: The upc rt07s evaluation system. In Multimodal Technologies for
Perception of Humans, pages 543–553. Springer, 2008.
[31] Olivier Galibert and Juliette Kahn. The first official repere evaluation. In SLAM@ INTER-
SPEECH, pages 43–48, 2013.
50
Bibliography 51
[32] Xuan Zhu, Claude Barras, Sylvain Meignier, and Jean-Luc Gauvain. Combining speaker
identification and bic for speaker diarization. In INTERSPEECH, volume 5, pages 2441–
2444, 2005.
[33] Benoit Favre, Geraldine Damnati, Frederic Bechet, Meriem Bendris, Delphine Charlet,
Remi Auguste, Stephane Ayache, Benjamin Bigot, Alexandre Delteil, Richard Dufour, et al.
Percoli: A person identification system for the 2013 repere challenge. In SLAM@ INTER-
SPEECH, pages 55–60, 2013.
[34] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using
adapted gaussian mixture models. Digital signal processing, 10(1):19–41, 2000.
[35] Jesper Hojvang Jensen, Daniel PW Ellis, Mads G Christensen, and Soren Holdt Jensen.
Evaluation distance measures between gaussian mixture models of mfccs. In ISMIR 2007:
Proceedings of the 8th International Conference on Music Information Retrieval: September
23-27, 2007, Vienna, Austria, pages 107–108. Austrian Computer Society, 2007.
[36] Najim Dehak, Patrick Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-
end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE
Transactions on, 19(4):788–798, 2011.
[37] Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck. Msr identity toolbox v1. 0: A
matlab toolbox for speaker recognition research. Speech and Language Processing Technical
Committee Newsletter, 2013.
[38] Trung Hieu Nguyen, Eng Siong Chng, and Haizhou Li. T-test distance and clustering
criterion for speaker diarization. In in Interspeech 2008. Citeseer, 2008.
[39] Deepu Vijayasenan, Fabio Valente, and Herve Bourlard. Agglomerative information bot-
tleneck for speaker diarization of meetings data. In Automatic Speech Recognition & Un-
derstanding, 2007. ASRU. IEEE Workshop on, pages 250–255. IEEE, 2007.
[40] Stephen Shum, Najim Dehak, Ekapol Chuangsuwanich, Douglas A Reynolds, and James R
Glass. Exploiting intra-conversation variability for speaker diarization. In INTERSPEECH,
pages 945–948, 2011.
[41] Patrick Kenny, Douglas Reynolds, and Fabio Castaldo. Diarization of telephone conver-
sations using factor analysis. IEEE Journal of Selected Topics in Signal Processing, 4(6):
1059–1070, 2010.
51
Bibliography 52
[42] Jan Silovsky and Jan Prazak. Speaker diarization of broadcast streams using two-stage
clustering based on i-vectors and cosine distance scoring. In ICASSP, pages 4193–4196.
IEEE, 2012.
[43] Mickael Rouvier and Sylvain Meignier. A global optimization framework for speaker di-
arization. In Odyssey Workshop, Singapore, 2012.
[44] Gregor Dupuy, Sylvain Meignier, Paul Deleglise, and Yannick Esteve. Recent improvements
on ilp-based clustering for broadcast news speaker diarization. In Proceedings of Odyssey,
2014.
[45] John Saunders. Real-time discrimination of broadcast speech/music. In icassp, pages 993–
996. IEEE, 1996.
[46] Zhu Liu, Yao Wang, and Tsuhan Chen. Audio feature extraction and analysis for scene
segmentation and classification. Journal of VLSI signal processing systems for signal, image
and video technology, 20(1-2):61–79, 1998.
[47] Sue E Johnson and Philip C Woodland. A method for direct audio search with applications
to indexing and retrieval. In Acoustics, Speech, and Signal Processing, 2000. ICASSP’00.
Proceedings. 2000 IEEE International Conference on, volume 3, pages 1427–1430. IEEE,
2000.
[48] Daben Liu Francis Kubala. Fast speaker change detection for broadcast news transcription
and indexing. 1999.
[49] Sylvain Meignier, Daniel Moraru, Corinne Fredouille, Jean-Francois Bonastre, and Laurent
Besacier. Step-by-step and integrated approaches in broadcast news speaker diarization.
Computer Speech & Language, 20(2):303–330, 2006.
[50] Hector Delgado, Corinne Fredouille, and Javier Serrano. Towards a complete binary key
system for the speaker diarization task. In Fifteenth Annual Conference of the International
Speech Communication Association, 2014.
52