voice command recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 1/40

VOICE COMMAND

RECOGNITION



AcquireSpeech Signal

PreprocessingFeature

ExtractionFeature

Matching

RecognizedCommand

METHODOLOGY



A Simple Model

of Speech Production

Voicedexcitation pulse

train

P(f)

Unvoicedexcitation white

noise N(f)

Vocal tractspectral

shaping H(f)

Lips emissionR(f)

= . + . . .

= . .



Spectral Shaping ()

• Changing the shape of the vocal tract changes the spectral

shape of the speech signal, thus articulating different speech

sounds

• Most valuable information for speech recognizer is contained in

the way the spectral shape of the speech signal changes in time.

• Direct computation of power spectrum from the speech signal

results in a spectrum containing “ripples” caused by the

excitation spectrum ().

• A smooth spectral shape without the ripples that represent

() has to be estimated.



Cepstral Transformation

= . . = . log = log( . )

= log + log( )



• Interpret this log-spectrum as a time signal

• The “ripples” caused by () would then have a “high-frequency”.

• Hence, by using a kind of low pass filtering we can get the smoothspectral shape

• Inverse Fourier transform of the log spectrum brings us back to the

time domain, giving the so called cepstrum.

•Low pass filtering is done by setting the higher valued Cepstralcoefficients to zero and then transforming back to the frequency

domain.

• The process of filtering in the Cepstral domain is called “liftering”.



Mel frequency Cepstral Coefficients

• Human ear does not show a linear frequency resolution but

builds several group of frequencies and integrates the spectral

energies within a given group

• The mid frequency and bandwidth of these groups are non-

linearly distributed.

• The non-linear warping of the frequency axis is modeled by

the mel-scale where the frequency groups are assumed to

be linearly distributed

= 2595. log(1 +

700 )



• Common way to do mel frequency warping is to use triangle shaped filte

in the spectral domain to build a weighted sum over the power spectru

coefficients which lie within each window.

• This gives us a new set of coefficients known as the mel spectral coeffici• Perform Cepstral Transformation on them to extract Mel frequency Ceps

Coefficients.

• The MFCC are directly used for recognition instead of transforming them

back to frequency domain.



Feature Matching



Dynamic Time Warping (DTW)

• Distance calculation using Dynamic Time

Warping



• Each utterance is divided into frames of 20ms.

• MFCC for each of frame is computed and represented by a

vector.

• Hence each utterance is represented by a vector sequence.

X = {x 0 ,x 1 ,….,xTx −1}

• Distance between individual vectors are found using the

Euclidean distance formula.



DTW Algorithm

• Finding the optimal alignment path



DTW Algorithm

Key points to find the optimal path

• A grid point (i,j) in the optimal path can have the predecessors

(i-1,j), (i-1,j-1) and (i,j-1)



• Bellman’s Principle : If Popt is the optimal path through the

matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx-

1), and grid point (i,j) is part of path Popt, then the partial path from (0,0) to (i,j) is also part of Popt

• Creating an Accumulated distance matrix, according to the

formula

• The accumulated distance at the point (Tw-1,Tx-1) is the

distance between the vector sequence W and X .



Iteration steps in finding the optimal path



VOICE COMMAND RECOGNITION VI



Front Panel



Block Diagram



Step 1: Acquiring the Speech Signal

• The input speech signal has been acquired using LabVIEW

“Acquire Sound Express VI “ for 3sec at a sampling rate of

11025Hz.

• An array of LED’s in the front panel indicates the progress of

acquiring.



Step 2: Pre-processing

• Preprocessing of the input speech signal consist of the

following steps



Block Diagram of the Preprocessing sub VI



2.1 Pre-Emphasis

• The goal of pre-emphasis is to compensate for the high frequency part

that was suppressed during the sound production mechanism of humans.

Thus the speech signal is passed through a FIR high pass filter which

increases the magnitude of some higher frequencies with respect to the

magnitude of other frequencies hence improves the over-all signal to

noise ratio. = − 0.95[ − 1]



2.2 Framing

• The input speech signal is segmented into small frames of

20ms length with 50% overlap with the adjoining frames to

create continuity.



2.3 Windowing

• Each frame is multiplied with the hamming window in time

domain. This helps to reduce the discontinuity at the start and

end of each frames.

= 0.54 − 0.46 cos2

− 1



2.4 Noise threshold detection

• For detecting the starting of the utterance from the 3sec long

input speech signal, energy of each frame of the input speech

signal is calculated and stored into an array. Size of the energy

array will be equal to the total number of frames. This energy

array is arranged in the ascending order and mean of first 15

elements gives the energy of the noise. Threshold set was 10times the noise energy.



2.5 Utterance detection

• Once the threshold has been calculated, all the elements in the energy

array which are greater than the threshold are replaced by 1 and the rest

by 0. Thus a Boolean array of the following form is obtained.

• Sometimes spikes due to the external noise crosses the threshold and

contributes 1 to the Boolean array. To remove these spikes a Median filter

VI in LabVIEW with left and right rank as 3 is used. The median filter

replaces the i th element in the Boolean array with the median of {

− 3, − 2, − 1, , − 1, + 2, + 3}elements.

Hence the median filter smoothen the Boolean array.



• Now we use the Peak detector VI in LabVIEW to find the index of the start and

end of the utterance. Using these index extract the corresponding frames

containing the utterance.

• N.B

: In my project, all commands where of length less than 0.6sec. Sometimesspikes due to noise remained even after using the median filter and hence the

ending index was not detected accurately. But the start index was detected

accurately most of the time, so I used to extract 0.6sec of sound after the start

index.



Step 3: Feature Extraction



Block Diagram of Feature Extraction VI



• FFT is done on each frame of the utterance and half of it is

taken.

•The spectrum of each frame is warped onto the Mel scale andthus Mel spectral coefficients are obtained.

• Discrete cosine transform is done on Mel spectral coefficients of

each frame, hence obtaining MFCC.

•The first 2 coefficients of the obtained MFCC are removed asthey varied significantly between different utterances of the

same word.

• Liftering is done by replacing all MFCC except the first 14 by

zero.• The first coefficient of MFCC of each frame was replaced by the

log energy of that frame.

• Delta and Acceleration coefficients are found from the MFCC so

as to increase the dimension of the feature vector of theframes thereb increasin the accurac .



• Delta coefficients are found from the following equation. Value

of p chosen was 1.

• Acceleration coefficients are found by replacing the MFCC in the

above equation by delta coefficients

• Feature vector is normalized by subtracting their mean from

each elements

Thus each frame of utterance is converted into a feature vector

of dimension 35.



Step 4 : Feature Matching

• Dictionary with six sets has been created.

• In each set, the feature vector sequence of the words to be

recognized are stored.

• The feature of the test sequence is compared with each words

in the sets using DTW and the best match in each set is

outputted.

• The mode of all six set is considered to be the recognized

command.

• Threshold is set so that random speech signal doesn't result in

a match with the commands.



Front Panel of the Dictionary sub VI



Block Diagram of the Dictionary

sub VI



Block Diagram of the DTW sub VI



Video Demonstration of the VI

http://www.youtube.com/watch?v=aEqa-t_TWiY



Limitations•

Environment DependentThe input speech feature vector is compared with a set of

feature vectors in the dictionary which were recorded in a

particular environment. So when used from a different

environment the efficiency decreases unless the threshold and

the dictionary are updated accordingly.

• Speaker DependentAs the dictionary is trained by a particular user, the VI outputs

consistent results when used by the trainer.



Questions..?



Thank You

voice command recognition

Documents