voice command recognition
TRANSCRIPT
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 1/40
VOICE COMMAND
RECOGNITION
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 2/40
AcquireSpeech Signal
PreprocessingFeature
ExtractionFeature
Matching
RecognizedCommand
METHODOLOGY
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 3/40
A Simple Model
of Speech Production
Voicedexcitation pulse
train
P(f)
Unvoicedexcitation white
noise N(f)
Vocal tractspectral
shaping H(f)
Lips emissionR(f)
= . + . . .
= . .
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 4/40
Spectral Shaping ()
• Changing the shape of the vocal tract changes the spectral
shape of the speech signal, thus articulating different speech
sounds
• Most valuable information for speech recognizer is contained in
the way the spectral shape of the speech signal changes in time.
• Direct computation of power spectrum from the speech signal
results in a spectrum containing “ripples” caused by the
excitation spectrum ().
• A smooth spectral shape without the ripples that represent
() has to be estimated.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 5/40
Cepstral Transformation
= . . = . log = log( . )
= log + log( )
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 6/40
• Interpret this log-spectrum as a time signal
• The “ripples” caused by () would then have a “high-frequency”.
• Hence, by using a kind of low pass filtering we can get the smoothspectral shape
• Inverse Fourier transform of the log spectrum brings us back to the
time domain, giving the so called cepstrum.
•Low pass filtering is done by setting the higher valued Cepstralcoefficients to zero and then transforming back to the frequency
domain.
• The process of filtering in the Cepstral domain is called “liftering”.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 7/40
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 8/40
Mel frequency Cepstral Coefficients
• Human ear does not show a linear frequency resolution but
builds several group of frequencies and integrates the spectral
energies within a given group
• The mid frequency and bandwidth of these groups are non-
linearly distributed.
• The non-linear warping of the frequency axis is modeled by
the mel-scale where the frequency groups are assumed to
be linearly distributed
= 2595. log(1 +
700 )
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 9/40
• Common way to do mel frequency warping is to use triangle shaped filte
in the spectral domain to build a weighted sum over the power spectru
coefficients which lie within each window.
• This gives us a new set of coefficients known as the mel spectral coeffici• Perform Cepstral Transformation on them to extract Mel frequency Ceps
Coefficients.
• The MFCC are directly used for recognition instead of transforming them
back to frequency domain.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 10/40
Feature Matching
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 11/40
Dynamic Time Warping (DTW)
• Distance calculation using Dynamic Time
Warping
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 12/40
• Each utterance is divided into frames of 20ms.
• MFCC for each of frame is computed and represented by a
vector.
• Hence each utterance is represented by a vector sequence.
X = {x 0 ,x 1 ,….,xTx −1}
• Distance between individual vectors are found using the
Euclidean distance formula.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 13/40
DTW Algorithm
• Finding the optimal alignment path
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 14/40
DTW Algorithm
Key points to find the optimal path
• A grid point (i,j) in the optimal path can have the predecessors
(i-1,j), (i-1,j-1) and (i,j-1)
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 15/40
• Bellman’s Principle : If Popt is the optimal path through the
matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx-
1), and grid point (i,j) is part of path Popt, then the partial path from (0,0) to (i,j) is also part of Popt
• Creating an Accumulated distance matrix, according to the
formula
• The accumulated distance at the point (Tw-1,Tx-1) is the
distance between the vector sequence W and X .
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 16/40
Iteration steps in finding the optimal path
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 17/40
VOICE COMMAND RECOGNITION VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 18/40
Front Panel
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 19/40
Block Diagram
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 20/40
Step 1: Acquiring the Speech Signal
• The input speech signal has been acquired using LabVIEW
“Acquire Sound Express VI “ for 3sec at a sampling rate of
11025Hz.
• An array of LED’s in the front panel indicates the progress of
acquiring.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 21/40
Step 2: Pre-processing
• Preprocessing of the input speech signal consist of the
following steps
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 22/40
Block Diagram of the Preprocessing sub VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 23/40
2.1 Pre-Emphasis
• The goal of pre-emphasis is to compensate for the high frequency part
that was suppressed during the sound production mechanism of humans.
Thus the speech signal is passed through a FIR high pass filter which
increases the magnitude of some higher frequencies with respect to the
magnitude of other frequencies hence improves the over-all signal to
noise ratio. = − 0.95[ − 1]
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 24/40
2.2 Framing
• The input speech signal is segmented into small frames of
20ms length with 50% overlap with the adjoining frames to
create continuity.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 25/40
2.3 Windowing
• Each frame is multiplied with the hamming window in time
domain. This helps to reduce the discontinuity at the start and
end of each frames.
= 0.54 − 0.46 cos2
− 1
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 26/40
2.4 Noise threshold detection
• For detecting the starting of the utterance from the 3sec long
input speech signal, energy of each frame of the input speech
signal is calculated and stored into an array. Size of the energy
array will be equal to the total number of frames. This energy
array is arranged in the ascending order and mean of first 15
elements gives the energy of the noise. Threshold set was 10times the noise energy.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 27/40
2.5 Utterance detection
• Once the threshold has been calculated, all the elements in the energy
array which are greater than the threshold are replaced by 1 and the rest
by 0. Thus a Boolean array of the following form is obtained.
• Sometimes spikes due to the external noise crosses the threshold and
contributes 1 to the Boolean array. To remove these spikes a Median filter
VI in LabVIEW with left and right rank as 3 is used. The median filter
replaces the i th element in the Boolean array with the median of {
− 3, − 2, − 1, , − 1, + 2, + 3}elements.
Hence the median filter smoothen the Boolean array.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 28/40
• Now we use the Peak detector VI in LabVIEW to find the index of the start and
end of the utterance. Using these index extract the corresponding frames
containing the utterance.
• N.B
: In my project, all commands where of length less than 0.6sec. Sometimesspikes due to noise remained even after using the median filter and hence the
ending index was not detected accurately. But the start index was detected
accurately most of the time, so I used to extract 0.6sec of sound after the start
index.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 29/40
Step 3: Feature Extraction
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 30/40
Block Diagram of Feature Extraction VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 31/40
• FFT is done on each frame of the utterance and half of it is
taken.
•The spectrum of each frame is warped onto the Mel scale andthus Mel spectral coefficients are obtained.
• Discrete cosine transform is done on Mel spectral coefficients of
each frame, hence obtaining MFCC.
•The first 2 coefficients of the obtained MFCC are removed asthey varied significantly between different utterances of the
same word.
• Liftering is done by replacing all MFCC except the first 14 by
zero.• The first coefficient of MFCC of each frame was replaced by the
log energy of that frame.
• Delta and Acceleration coefficients are found from the MFCC so
as to increase the dimension of the feature vector of theframes thereb increasin the accurac .
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 32/40
• Delta coefficients are found from the following equation. Value
of p chosen was 1.
• Acceleration coefficients are found by replacing the MFCC in the
above equation by delta coefficients
• Feature vector is normalized by subtracting their mean from
each elements
Thus each frame of utterance is converted into a feature vector
of dimension 35.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 33/40
Step 4 : Feature Matching
• Dictionary with six sets has been created.
• In each set, the feature vector sequence of the words to be
recognized are stored.
• The feature of the test sequence is compared with each words
in the sets using DTW and the best match in each set is
outputted.
• The mode of all six set is considered to be the recognized
command.
• Threshold is set so that random speech signal doesn't result in
a match with the commands.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 34/40
Front Panel of the Dictionary sub VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 35/40
Block Diagram of the Dictionary
sub VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 36/40
Block Diagram of the DTW sub VI
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 37/40
Video Demonstration of the VI
http://www.youtube.com/watch?v=aEqa-t_TWiY
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 38/40
Limitations•
Environment DependentThe input speech feature vector is compared with a set of
feature vectors in the dictionary which were recorded in a
particular environment. So when used from a different
environment the efficiency decreases unless the threshold and
the dictionary are updated accordingly.
• Speaker DependentAs the dictionary is trained by a particular user, the VI outputs
consistent results when used by the trainer.
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 39/40
Questions..?
7/27/2019 Voice Command Recognition
http://slidepdf.com/reader/full/voice-command-recognition 40/40
Thank You