66
CHAPTER 4
VOICE ACTIVITY DETECTION ALGORITHMS
4.1 INTRODUCTION
New frontiers of speech technology are demanding increased levels
of performance in many areas. In the advent of Wireless Communications
new speech services are becoming a reality with the development of modern
robust speech processing technology. Many researchers discussed about the ill
effect of environmental noise on the system performance of speech
processing. Abhijeet Sangwan et al (2002) discussed many issues associated
with desirable aspects of Voice Activity Detection (VAD) algorithms based
on a good decision rule, adaptability to background noise and low
computational complexity for estimating the noise spectrum.
Background noise acoustically added to speech can degrade the
performance of digital voice processors used for applications such as speech
compression, recognition, and authentication (Isrel 2003). Digital voice
systems will be used in a variety of environments, and their performance must
be maintained at a level near that measured using noise-free input speech. To
ensure continued reliability, the effects of background noise can be reduced
by using, internal modification of the voice processor algorithms to explicitly
compensate for signal contamination, or preprocessor noise reduction and
noise-cancelling microphones
67
Khaled et al (1997) observed that high-energy voiced speech
segments are always detected in all VADs under very noisy conditions such
as car, bus, babble, and street noise. However, low-energy unvoiced speech is
commonly missed. The background noise which contaminates the signal
results in either noise only or speech plus noise segments. The VAD
developed by Javier Ramirez et al (2005), makes it possible to define an
effective endpoint detection algorithm employing a novel noise reduction
techniques and order statistic filters for the formulation of the decision rule.
The VAD performs an advanced detection of beginnings and delayed
detection of word endings which, in part, avoids the inclusion of additional
hangover schemes. In addition, VAD provides speech / non speech
discrimination also. It has been observed that low energy portions of speech
are first to be falsely rejected. A hangover scheme is required to lower the
probability of false rejections (Alan Davis et al 2006).
Robustness can be achieved by an appropriate extraction of robust
features in the front-end and/or by the adaptation of the reference to the noise
situation. Noise signals are selected to represent the most probable application
scenarios for telecommunication terminals. Some noises are fairly stationary.
They are the car noise and the recording in the exhibition hall. Others noises
contain non-stationary like, the recordings on the street and at the airport.
A fast noise estimation algorithm proposed by Sundarrajan
Rangachari et al (2004) resulted in good performance for a single sentence.
The noise estimate was found by averaging past spectral power values using a
smoothing parameter that was adjusted by the signal presence probability in
subbands case as discussed Sundarrajan Rangachari et al (2004).
68
A novel VAD algorithm developed by Dong Kook Kim et al (2007)
based on the Gaussian distribution and the uniformly most powerful (UMP)
test to detect the speech or nonspeech from the input noisy signal. This
method provides the decision rule by comparing the magnitude of the noisy
speech signal to the adaptive threshold estimated from the noise statistics.
A conditional Maximum a posteriori (MAP) criterion decides the
hypothesis with the maximum conditional probability given both the
observation and the voice activity in the previous frame. This criterion leads
to two separates thresholds for Likelihood Ratio Test (LRT) depending on the
previous VAD result frame case as discussed Jong Won Shin et al (2008).
Several VAD algorithms have been proposed for detecting the
voiced / unvoiced region (Boll Steven et al 1980, Dhananjaya et al 2010, Falk
Tiago et al 2006, Haitian Xu et al 2007, Jongseo Sohn et al 1999, Juan
Manuel Gorriz et al 2008, Matteo Gerosa et al 2007, Plante et al 1998, Qi Li
et al 2002, Richard et al 2000, Yutaka Kaneda et al 1986, Zenton Goh et al
1999, Zhong Lin et al 2007).
In this chapter, the Voice Activity Detection (VAD) developed by
Ramirez et al (2005) is presented along with the noise estimation algorithm as
discussed in Sundarrajan Rangachari et al (2004) and Abhijeet Sangwan et al
(2002). Various VAD algorithms are studied and comparison of their
performance based on parameters such as Zero Crossing Detection (ZCD),
Weak Fricative Detection (WFD), Pitch Based Detection (PBD), Energy
Based Detection (EBD) and Subband Order Statistics Filter (OSF) in presence
of different types of noise like suburban train noise, babble, car, exhibition
hall, restaurant, street, airport and train-station noise for Automatic Speech
Recognition (ASR) are carried out.
69
4.2 VOICE ACTIVITY DETECTION ALGORITHMS
A straight forward approach is to identify Voice Activity Detection
(VAD), i.e, the processes of discrimination of speech from silence or other
background noise. The VAD algorithms are based on any combination of
general speech properties such as temporal energy variations, periodicity, and
spectrum. The detection task is not as trivial as it appears since the increasing
level of background noise degrades the classifier effectiveness. VAD
indicates the presence or absence of speech as observed by Ramirez 2004.
Voice is differentiated into speech or silence based on speech
characteristics. The signal is sliced into contiguous frames. A real valued
nonnegative parameter is associated with each frame. If this parameter
exceeds a certain threshold, the signal is classified as active or inactive.
The basic principle of VAD device is that it extracts some
measured features or quantities from the input signal and then compares these
values with thresholds. Voice activity (VAD=1) is declared if the measured
value exceeds the threshold. Otherwise (VAD=0) is declared for no speech
activity. In general, a VAD algorithm outputs a binary decision in a frame by
frame basis where a frame of a input signal is a short unit of time such as 20-
40ms.
The following are some of the required features of a good VAD
algorithm:
(i) Good Decision Rule: A physical property of speech that can
be exploited to give consistent and accurate judgment in
classifying segments of the signal into silence or otherwise.
70
(ii) Adaptability to background Noise: Adapting to non stationary
background noise improves the robustness, especially in
wireless telephony.
(iii) Low Computational Complexity. The complexity of VAD
algorithm must be low to suit real-time applications.
A tree diagram that represents the classification techniques for
VAD algorithms are shown in Figure 4.1.
Figure 4.1 Tree diagram for VAD Algorithms
The VAD Algorithm is classified into two types.
(i) Parameter Based VAD Algorithm and
(ii) Frequency Based VAD Algorithm.
Parameter Based
VAD Algorithms
Frequency Based
Thresholding: ZCD
Linear Variance: EBD
Segmentation :PBD
Transform (Power Spectral Density):WFDSubband OSF
71
Parameter Based VAD Algorithms are further classified into three
types.
(i) Zero Crossing Detector which based on thresholding.
(ii) Energy Based Detection which is implemented through Linear
Variance.
(iii) Pitch Based Detection through Segmentation.
Frequency Based VAD Algorithm which consists of Weak
Fricative and Subband Order Statistic Filter which are formed under
Transformation method.
4.2.1 Zero Crossing Detector (ZCD)
The Zero Crossing Detector (ZCD) is defined as the number of
times in a sound sample that the amplitude of sound wave changes sign. Zero
Crossing for a signal is the number of times that it crosses the line of no
disturbance or zero line (Abhijeet Sangwan et al 2002). The number of zero
crossings for a voice signal lies in fixed range. For a 10ms duration, the
number of zero crossing lies between 5 and 15. The number of zero crossing
for noise is random and unpredictable. This reason innovate formulate a
decision rule that is independent of energy and hence able to detect some low
energy phonemes.
If Frame is ACTIVE
else Frame is INACTIVE (4.1)
is the number of Zero Crosses detected in fj R is the set
of values of {5-15}, the number of zero crossing for speech duration of 10ms.
72
4.2.2 Weak Fricatives Detector (WFD)
The main drawback of ZCD is that of misclassification of noise
frames as Active one when zero crossings of the noise frames satisfies the
equation (4.1).
The problem of discriminating speech from background noise is not
trivial, except in the case of extremely high Signal to Noise Ratio acoustic
environments. For such high Signal to Noise Ratio (SNR) environments, the
energy of the lowest level speech sounds exceeds the background noise
energy, and thus a simple energy measurement suffices. However, such ideal
recording conditions are not practical for most applications (Rabiner 2004).
Therefore a method is required to classify weak fricatives from
noise dependent of SNR or other noise characteristics. This particular
problem can be made to overcome by using Auto correlation function which
is exploited by the high correlation found in speech signals.
The unbiased autocorrelation function as
[ ] = [ ]( ) [ ] (4.2)
A[x] is the autocorrelation vector
y[n] is the vector under consideration
n is the frame length
Each frame of the incoming signal is segmented into frames of
duration 20ms. The energy of each frame is computed as
= (( 1) 4 + )
73
2where subframes takes the value from 1 to the total number of subframes in
the sample, index denotes each sample in the given vector. Thus a vector of
20 such energy values is computed for each frame, which is denoted as
( ) (4.4)
where j is the frame under consideration. The classification parameter that is
used the variance of the above vector. The Autocorrelation Vector Variance
(AVV) is determined as
( ( ) ) (4.5)
A reference value for AVV for silence frame is computed by
assuming that the first 20 frames to be inactive
= ( ) (4.6)
A reference value for AVV for silence frame is computed by
assuming that the first 20 frames to be inactive. We compare the AVV of
subsequent frames with a scalar multiple of this reference value, to determine
speech activity.
If Frame is ACTIVE
else Frame is INACTIVE (4.7)
The value of k was set to 7 after trial and error. Only active frames
are marked as voiced signal and inactive frames are unvoiced signal.
4.2.3 Pitch Period Based Detector (PBD)
Pitch period estimation is one of the most important problems in
speech processing. Pitch detectors are used in vocoders, speaker identification
74
and verification systems. Pitch period estimation can be done using the
autocorrelation function. The autocorrelation function provides a convenient
representation and it forms the basis for pitch detection.
One of the major limitations of using the autocorrelation
representation is that of retention of information in the speech signals. As a
result the autocorrelation function has too many peaks. To estimate this
problem it is useful to process the speech signal so as to make the periodicity
more prominent while suppressing other distracting features of the signal.
Numerous techniques have been proposed and a technique called centre
clipping is reported in this thesis.
The centre clipped (Sondhi 1968) speech signal is obtained by a
nonlinear transformation
( ) = [ )] (4.8)
where C[ ] is shown in Figure 4.2
Figure 4.2 Centre clipper transformation function
The operation of center clipping is depicted in Figure 4.3
75
Figure 4.3 Centre clipping affects a speech waveform
It can be seen that for samples above CL, the output of the centre
clipper is equal to the input minus the clipping level. For samples below the
clipping level, the output is zero. For high clipping levels, fewer peaks will
exceed the clipping level and thus fewer pulses will appear in the output. If
the clipping level is decreased, more peaks pass through the clipper and the
auto correlation function becomes more complex (Rabiner 2004).
The problem of extraneous peak can be eliminated in the
autocorrelation function by center clipping prior to computing the
autocorrelation function. However another difficulty with autocorrelation
function representation is that large amount of computation that is required. A
simple modification to centre clipping function leads to greater simplification
in autocorrelation computation. The output of the clipper is +1 if x(n) > + CL
76
and -1 if x(n) < - CL. Otherwise the output is zero. The computation of the
autocorrelation function for a 3-level center clipped signal is particularly
simple. Most of the extraneous peaks are eliminated, and a clear indication of
periodicity is retained. The three level center clipping function is shown in
Figure 4.4.
Figure 4.4 Three level center clipping function
A novel algorithm for estimating the pitch period from the short-
time autocorrelation function proposed by Dubnowski et at (1976). The steps
in the pitch based VAD algorithm is given below:
i. The speech signal is filtered with a 900 Hz low pass analog
filter and sampled at a rate of 10 kHz.
ii. Segments of length 30msec are selected at 10msec intervals.
iii. Using the clipping level, the speech signal is processed by a 3-
level centre clipper and the correlation function is computed
over a range spanning the expected range of pitch periods.
iv. The largest peak of the autocorrelation function is located and
the peak value is compared to a fixed threshold. If the peak
falls below threshold, the segment is classified as unvoiced
else the segment is voiced.
77
4.2.4 Energy Based Detector (EBD)
The amplitude of the speech signal varies appreciably with time.
The amplitude of unvoiced segments is generally much lower than that of
voiced segments. The energy of a signal represents a convenient
representation that reflects the amplitude of the signal. Energy of a frame
indicates the possible presence of voice data and is an important parameter
used in VAD algorithms.
Let X(i) be the ith sample of speech. If the length of the frame
were k samples, then the jth frame can be represented in time domain by a
sequence as
= { ( )} ( ) (4.9)
Ej represents the energy of the jth frame as,
= ( )( ) (4.10)
The VAD algorithm is trained for a small period by a prerecorded
sample that contains only background noise. The initial threshold for various
parameters is computed from these samples. The initial energy theorem is
obtained by taking the mean of the energies (Em ) of the samples
= = (4.11)
E is the initial threshold estimate and is the number of frames in
a prerecorded sample, and the initial 20 frames are considered as INACTIVE.
78
The classification rule for speech is as follows,
if > k (k > 1) frame is ACTIVE
else frame is INACTIVE (4.12)
Here, represents the energy of noise frame, while k is the
threshold being used in the decision making. Active frames are transmitted
while Inactive frames are not transmitted.
Energy based decisions are not good for low energy phonemes.
Weak fricatives are sometimes silenced completely. High energy voiced
speech segments are detected in all VAD algorithms even under noise
conditions. However, low energy unvoiced speech is commonly missed,
reducing speech quality.
4.2.5 Subband OSF Based VAD
Javier Ramirez et al (2005) proposes the determination of the
speech / nonspeech divergence by means of specialized Order Statistics Filter
(OSF) working on the subband log-energies. The filters based on order
statistics have been successfully employed in restoration of signals and
images corrupted by additive noise. The most common OSF is the median
filter that is easy to implement and exhibits good performance in removing
impulsive noise.
Figure 4.5 enumerates the block diagram of the subband based
VAD. This algorithm operates on the subband log-energies. Noise reduction
is performed first and the VAD decision is formulated on the de-noised
signal. The noisy speech signal is decomposed into 25 ms frames with a 10
ms window shift. Let X (m,l) be the spectrum magnitude for the mth band at
frame l .The design of the noise reduction block is based on Wiener Filter
79
(WF) theory whereby the attenuation is a function of the Signal to Noise
Ratio (SNR) of the input signal. The VAD decision is formulated in terms of
the de-noised signal. The subband log-energies are processed by means of
order statistics filters.
) ( ) ( ) )
) )
1)
Figure 4.5 Block diagram of Subband OSF based VAD
The noise reduction block consists of four stages.
i) Spectrum smoothing: The power spectrum is averaged over
two consecutive frames and two adjacent spectral bands.
ii) Noise estimation: The noise spectrum ) is updated by
means of a 1st order IIR filter on the smoothed
spectrum ),
) = 1) + ( ) ) (4.13)
where =0.99 and =0,1,…,NFFT/2, (NFFT= Nonequispaced FFT)
FFTNOISEREDUCTION
VAD
SPECTRALSMOOTHING
WFDESIGN
FREQUENCYDOMAINFILTERING
NOISEUPDATE
80
iii) Wiener Filter (WF) design: First, the clean signal ) is
estimated by combining smoothing and spectral subtraction
) = ’ 1) + (1 ),0)
(4.14)
where = 0.98 .
Then, the WF ) is designed as
( ) = ( ) ( )
(4.15)
where
( ) = max ( ) ( )
, (4.16)
and is selected so that the filter yields a 20 dB maximum attenuation.
’ ), the spectrum of the cleaned speech signal, is assumed to be zero at
the beginning of the process and is used for designing the WF through
equation (2.13) to equation (2.15). It is given by
’ ) = ) (4.17)
The filter ) is smoothed in order to eliminate rapid changes
between neighbor frequencies that may often cause musical noise. Thus, the
variance of the residual noise is reduced and consequently, the robustness
when detecting nonspeech is enhanced. The smoothing is performed by
truncating the impulse response of the corresponding causal FIR filter to 17
taps using a Hanning window. With this operation performed in the time
domain, the frequency response of the Wiener Filter is smoothed and the
performance of the VAD is improved.
81
iv) Frequency domain filtering: The smoothed filter is applied
in the frequency domain to obtain the denoised spectrum
) = ). (4.18)
Once the input speech has been de-noised, the log-energies for the
lth frame, ), in subbands ( = 0,1, … . . 1) are computed by means
of
E( ) = logK
NFFT(Y ( ) )
= k= 0,1,…K-1 (4.19)
where an equally spaced subband assignment is used.
The algorithm uses two OSF for the multiband quantile (MBQ)
SNR estimation. A first OSF estimates the subband signal energy by means of
) = ( ) ) ) + ) ) (4.20)
where ) is the p sampling quantile, = [ 2 ] and = 2 .
Finally, the SNR in each subband is measured by
) = ) (4.21)
where ) is the noise level in the kth band that needs to be estimated. For
the initialization of the algorithm, the first N frames are assumed to be
nonspeech frames and the noise level in the kth band, ), is estimated as
the median of the set (0, ), (1, ), … 1, )}. In order to track
82
nonstationary noisy environments, the noise references are updated during
nonspeech periods by means of a second OSF (a median filter)
) = ) + ( ) ), k=0,1,…..,K-1 (4.22)
where ), is the output of the median filter and =0.97 was
experimentally selected. On the other hand, the sampling quantile p=0.9 is
selected as a good estimation of the subband spectral envelope.
The decision rule is then formulated in terms of the average
subband SNR
( ) = QSNR( ) (4.23)
If the SNR is greater than a threshold , the current frame is
classified as speech, otherwise it is classified as nonspeech. It is assumed that
the system will work at different noisy conditions and that an optimal
threshold can be determined for the system working in the cleanest ( ) and
noisiest conditions ( ). Thus, the threshold is adaptive to the measured full-
band noise energy
=<
( ) (4.24)
thus enabling the VAD selecting the optimum working point for different
SNR conditions. Note that, the threshold is linearly decreased as the noise
level is increased between (E , )and (E , ) which represent optimum
thresholds for the cleanest and noisiest conditions defined by the noise
energies E and , respectively.
83
4.3 DRAWBACKS OF EXISTING ALGORITHMS
The existing algorithm is based on the assumption that noise
spectrum does not significantly vary within a N frame of the neighborhood of
the lst frame. However, this is not true in the case of highly stationary noise.
Noise estimation of the first frame is used to denoise 8 frames forward. Noise
estimate is very low for the first frame. So the algorithm fails at the beginning
to evaluate the noise spectrum and the detection afterwards could be totally
erroneous. The existing algorithm also fails to update the threshold in low
noise conditions. This will degrade the performance of VAD.
4.3.1 Proposed Algorithm
The proposed algorithm does not depend on the feedback loop for
noise spectrum estimation. Instead it uses a noise estimation algorithm which
updates noise for every frame. This method of noise estimation is best suited
for highly non stationary environments, thus increasing the robustness as
discussed in Sundarrajan Rangachari et al (2004).
) ) ) )
) )
Figure.4.6 Block diagram of proposed VAD
FFT NOISEREDUCTION VAD
SPECTRALSMOOTHING
WFDESIGN
FREQUENCYDOMAINFILTERING
NOISEUPDATE
84
The noise estimate is updated by averaging the noisy speech power
spectrum using a time and frequency dependent smoothing factor, which is
adjusted based on signal presence probability in subbands. It improves the
speech/non-speech discriminability and speech recognition performance in
noisy environments. Two problems are solved using VAD. The first one is
performance of VAD in low noise condition and the second is with noisy
environment. The block diagram of proposed VAD is shown in Figure 2.6.
The noise estimation algorithm is as follows
The smoothed power spectrum of the noisy speech signal is
estimated using a first-order recursive formula as
) = 1, ) + ( )| ( )| (4.25)
where |Y( , k)| is an estimate the short time power spectrum of noisy
speech and is the smoothing constant, where is the frame index and k is
the frequency bin index.
Since the noisy speech power spectrum in the speech absent frames
is equal to the power spectrum of the noise, we can update the estimate of the
noise spectrum by tracking the speech absent frames. To compute the ratio of
the energy of the noisy speech power spectrum in three different frequency
bands (low: 0-1kHz, middle: 1-3 kHz, high: 3 kHz and above) to the energy
of the corresponding frequency band in the previous noise estimate. The
following three ratios are computed:
( ) = ( )( )
(4.26)
( ) = ( )( )
(4.27)
85
( ) = ( )
( ) (4.28)
where ) is the estimate of the noise power spectrum at frame , and
Low Frequency, Medium Frequency, Fs correspond to the frequency bins of
1 kHz, 3 kHz and the sampling frequency respectively. The speech frame is
classified as speech present or speech absent in the following manner. The
incoming frame is classified as speech absent frame if the following condition
is satisfied
( ) < ( ) < ( ) (4.29)
where is threshold. The speech-absent frame and the noise estimate is
updated according to
( ) ( 1, ) + ( )| ( )| (4.30)
where is a smoothing constant. If any or all of the above three ratios are
larger than the threshold , then a different algorithm is used for updating and
estimating the noise spectrum.
In case of speech present frames, noise updation is as follows:
Frequency bins are classified as speech present or absent by
tracking the local minimum of noisy speech and then speech presence in each
frequency bin is decided separately using the ratio of noisy speech power to
its local minimum. A different non-linear rule is used for tracking the
minimum of the noisy speech by continuously averaging the past spectral
values.
86
if ( 1, ) < )
then
( ) = ( 1, ) + ( ( ) ( 1, )) (4.31)
else
( ) = )
where ( ) is the local minimum of the noisy speech power spectrum
and and are constants whose values are determined experimentally.
Let ( ) )/ ) denote the ratio between the
energy of the noisy speech to its local minimum. This ratio is compared
against a frequency-dependent threshold and if it is found to be larger than
that threshold, then the corresponding frequency is considered to contain
speech.
Using the above ratio ), the new frequency-dependent
smoothing constant can be estimated as follows:
( ) =( ) ( )
(4.32)
where , are smoothing constants ( , ) and ( ) is a frequency-
dependent threshold given as
( ) =1.3 1
3 5 /2
(4.33)
87
Finally, after computing the frequency-depending smoothing factor
s ( ,k) the noise spectrum estimate is updated according to
N( ,k)= s( ,k)N(( -1,k)+(1- s ( ,k) t))|Y ( ,k)|2 (4.34)
4.4 RESULTS AND DISCUSSIONS
The proposed structure for increasing the recognition accuracy of
the robust speech recognition system using VAD algorithms is shown in the
Figure 4.7. The system consists of two main parts, preprocessor and ASR.
The preprocessor includes Voice Activity Detector (VAD). VAD identifies
the presence or absence of speech and extracts the speech from the noise
corrupted speech.
Figure 4.7 Structure of speech recognition system
Figure 4.8 shows the original clean speech signal. Figure 4.9 shows
the output of the existing algorithm. Original signal corrupted by airport noise
of SNR 0 dB is given as input. Due to false estimation of noise spectrum the
algorithm fails at the beginning of the utterance itself. So most of the noise
only frames are classified as speech present frames. Figure 4.10 shows the
output of the proposed algorithm. The speech frames are extracted correctly
from the noisy speech signal.
One hundred words were taken for speech recognition (using
isolated word recognition with statistical modeling - Hidden Markov Model),
Input speech NoiseEstimation VAD ASR Recogniton
Accuracy
88
after adding various noise environments. We have analyzed input word
utterance under the most commonly encountered noise environments like
suburban train noise, babble, car, exhibition hall, restaurant, street, airport and
train-station noise were taken from the AURORA database.
In the training phase, the uttered words of 100 samples each digits
0-9, both male and female voice (age from 15-25) are recorded using 8-bit
Pulse Code Modulation (PCM) with a sampling rate of 8 kHz from single
channel input and saved as a wave file using sound recorder software
The proposed framework uses a speech processing module includes
the Hidden Markov Model (HMM)-based classification and noise language
modeling to achieve effective noise knowledge estimation which was
discussed in chapter 2.
The performance of ASR was analyzed under noisy conditions and
the same was analyzed using VAD and the accuracy in percentage is shown in
the Figure 4.11. The Subband Order Statistics Filter (OSF) method algorithm
performs better than other VAD algorithms. And the recognition accuracy of
all VAD algorithms can be improved if we consider noise estimation in the
non-stationary environment. This chapter presented a proposed structure of
Speech Recognition Systems with Subband Order Statistics Filter (OSF)
improving speech detection robustness in noisy environments. The approach
is based on an effective endpoint detection algorithm employing noise
reduction techniques and order statistic filters for the formulation of the
decision rule.
The Automatic speech recognition systems work reasonably well
under clean conditions but become fragile in practical applications involving
real-world environments.
89
Figure 4.8 Original clean speech signal
Figure 4.9 Output of existing VAD
0 1000 2000 3000 4000 5000 6000 7000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6Original signal without noise
0 1000 2000 3000 4000 5000 6000 7000-0.5
0
0.5
1Noisy input signal
90
Figure 4.10 Output of proposed algorithm
Table 4.1 through Table 4.8 depicts the performance of Subband
Order Statistics Filter (OSF) based Voice Activity Detection of Ramirez and
proposed algorithm for under various noise conditions in terms improvement
of Recognition Accuracy (RA).
From Tables 4.1 and 4.2 it was observed that the ASR with VAD in
presence of Babble noise source performed better with 20.81% of
improvement in RA compared to existing algorithm for SNR at 0 dB. The
speech recognition accuracy of proposed algorithm has an improvement of
13.54% in RA when compared to the algorithm proposed by the Ramirez et al
(2004) in presence of various noise sources.
0 1000 2000 3000 4000 5000 6000 7000-0.5
0
0.5
1Noisy input signal
91
92
From Tables 4.3 and 4.4 it was found that the in presence of
Exhibition noise with 5 dB SNR noise level the proposed algorithm
performed better with 11.71% of improvement in RA. The proposed
algorithm shows an improvement in percentage of RA as 8.07% when
compared to the existing algorithm (Ramirez et al).
From Tables 4.5 and 4.6 it was observed that the proposed
algorithm with 10 dB noise level for Train noise source shows an
improvement RA of 8.27%. The existing algorithm has an average RA of
80.01%, and the proposed algorithm has got an average RA of 85.18%.
From Tables 4.7 and 4.8 it was inferred that in presence the Airport
noise source at 15 dB level for proposed algorithm performed better with
5.64% of improvement RA. The proposed algorithm was having an
improvement RA of 3.67% when compared to the existing algorithm.
93
94
95
96
Table 4.9 shows the performance of ASR. The proposed method
performs better with maximum improvement of 20.81% RA for Babble noise
and with a minimum improvement of 2.26% RA for Street noise. The overall
performance analysis of the existing VAD algorithm with proposed algorithm
is shown in the Table 4.10
Table 4.9 Overall performance analysis of proposed VAD algorithm in
terms of % improvement in RA
PercentageImprovement 0dB 5dB 10dB 15dB
BetterBabble
(20.81 %)
Exhibition
(11.71 %)
Train
(8.27 %)
Airport
(5.64 %)
LeastAirport
(6.33 %)
Airport
(6.13 %)
Babble
(4.21 %)
Street
(2.26 %)
Table 4.10 Overall performance analysis of VAD Algorithms
VADMethod
0dB( % Accuracy)
5dB( % Accuracy)
10dB( % Accuracy)
15dB( % Accuracy)
EBD 18.20 28.80 29.35 39.66
ZCD 20.30 25.00 30.62 41.92
WFD 19.40 29.42 31.88 40.12
PBD 17.45 22.25 34.82 42.25
Ramirez et al 61.23 73.59 80.08 90.2
Proposed 70.89 80.05 85.18 93.64
97
The VAD recognition accuracy of different SNR values for the
Subband OSF based VAD and Proposed method are shown in the Figure 4.11.
It was observed that better recognition occurred for Restaurant noise
(84.225%) and least recognition for Exhibition noise (78.625%)
Figure 4.11 Comparison of Ramirez et al and proposed VAD method for
various noise environments
The proposed VAD works well for non-stationary signal. In most of
the speech enhancement schemes the noise signal is suppressed and speech
signal is enhanced. In our proposed VAD algorithm a new noise estimation
algorithm is presented along with the OSF which improves the quality as well
the RA of the speech recognition system.
4.5 CONCLUSION
The algorithms based solely on energy did not give an acceptable
Speech Recognition Accuracy with all the test templates. The other
techniques (Autocorrelation function and Zero Crossing Detection) gave
better Speech Recognition Accuracy. The ZCD was used to recover some low
energy phonemes that were rejected by the energy-based detector. However, it
also picked up certain noise frames that matched the Zero Crossing criteria.
657075808590
% o
f RA
Noise Sources
Overall % of RA for Proposed and Existing VAD
Proposed VADExisting VAD
98
WFD technique performed better than ZCD in detection of weak fricatives. A
pitch based detection algorithm is an algorithm designed to estimate the pitch
or fundamental frequency of a quasi periodic or virtually periodic signal. The
performance of PBD is different from other techniques. It produces better
performance as same as the WFD.
The proposed method for combining the noise estimation
algorithms and VAD algorithms, so that improved speech recognition
accuracy performance can be obtained under these noise conditions.
This chapter, presented a proposed structure of Speech Recognition
Systems with Subband Order Statistics Filter (OSF) improving speech
detection robustness in noisy environments. The approach is based on an
effective endpoint detection algorithm employing noise reduction techniques
and order statistic filters for the formulation of the decision rule. The
proposed algorithm performs better in the case of non stationary noise than
the existing algorithm.