voice command recognition

40
VOICE COMMAND RECOGNITION

Upload: moises-jaber

Post on 14-Apr-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 1/40

VOICE COMMAND

RECOGNITION

Page 2: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 2/40

AcquireSpeech Signal

PreprocessingFeature

ExtractionFeature

Matching

RecognizedCommand

METHODOLOGY

Page 3: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 3/40

A Simple Model

of Speech Production

Voicedexcitation pulse

train

P(f)

Unvoicedexcitation white

noise N(f)

Vocal tractspectral

shaping H(f)

Lips emissionR(f)

= . + . . .  

= . .  

Page 4: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 4/40

Spectral Shaping () 

• Changing the shape of the vocal tract changes the spectral

shape of the speech signal, thus articulating different speech

sounds

• Most valuable information for speech recognizer is contained in

the way the spectral shape of the speech signal changes in time.

• Direct computation of power spectrum from the speech signal

results in a spectrum containing “ripples” caused by the

excitation spectrum ().

• A smooth spectral shape without the ripples that represent

() has to be estimated.

Page 5: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 5/40

Cepstral Transformation

= . . = .  log = log( . ) 

= log + log( ) 

Page 6: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 6/40

• Interpret this log-spectrum as a time signal

• The “ripples” caused by () would then have a “high-frequency”.

• Hence, by using a kind of low pass filtering we can get the smoothspectral shape

• Inverse Fourier transform of the log spectrum brings us back to the

time domain, giving the so called cepstrum.

•Low pass filtering is done by setting the higher valued Cepstralcoefficients to zero and then transforming back to the frequency

domain.

• The process of filtering in the Cepstral domain is called “liftering”. 

Page 7: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 7/40

Page 8: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 8/40

Mel frequency Cepstral Coefficients

• Human ear does not show a linear frequency resolution but

builds several group of frequencies and integrates the spectral

energies within a given group

• The mid frequency and bandwidth of these groups are non-

linearly distributed.

• The non-linear warping of the frequency axis is modeled by

the mel-scale where the frequency groups are assumed to

be linearly distributed

  = 2595. log(1 + 

700 ) 

Page 9: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 9/40

• Common way to do mel frequency warping is to use triangle shaped filte

in the spectral domain to build a weighted sum over the power spectru

coefficients which lie within each window.

• This gives us a new set of coefficients known as the mel spectral coeffici• Perform Cepstral Transformation on them to extract Mel frequency Ceps

Coefficients.

• The MFCC are directly used for recognition instead of transforming them

back to frequency domain.

Page 10: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 10/40

Feature Matching

Page 11: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 11/40

Dynamic Time Warping (DTW)

• Distance calculation using Dynamic Time

Warping

Page 12: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 12/40

• Each utterance is divided into frames of 20ms.

• MFCC for each of frame is computed and represented by a

vector.

• Hence each utterance is represented by a vector sequence.

 X = {x 0 ,x 1 ,….,xTx −1} 

• Distance between individual vectors are found using the

Euclidean distance formula.

Page 13: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 13/40

DTW Algorithm

• Finding the optimal alignment path

Page 14: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 14/40

DTW Algorithm

Key points to find the optimal path

• A grid point (i,j) in the optimal path can have the predecessors

(i-1,j), (i-1,j-1) and (i,j-1)

Page 15: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 15/40

• Bellman’s Principle : If Popt is the optimal path through the

matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx-

1), and grid point (i,j) is part of path Popt, then the partial path from (0,0) to (i,j) is also part of Popt 

• Creating an Accumulated distance matrix, according to the

formula

• The accumulated distance at the point (Tw-1,Tx-1) is the

distance between the vector sequence W and X .

Page 16: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 16/40

Iteration steps in finding the optimal path

Page 17: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 17/40

VOICE COMMAND RECOGNITION VI

Page 18: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 18/40

Front Panel

Page 19: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 19/40

Block Diagram

Page 20: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 20/40

Step 1: Acquiring the Speech Signal

• The input speech signal has been acquired using LabVIEW

“Acquire Sound Express VI “ for 3sec at a sampling rate of 

11025Hz.

• An array of LED’s in the front panel indicates the progress of 

acquiring.

Page 21: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 21/40

Step 2: Pre-processing

• Preprocessing of the input speech signal consist of the

following steps

Page 22: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 22/40

Block Diagram of the Preprocessing sub VI

Page 23: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 23/40

2.1 Pre-Emphasis

• The goal of pre-emphasis is to compensate for the high frequency part

that was suppressed during the sound production mechanism of humans.

Thus the speech signal is passed through a FIR high pass filter which

increases the magnitude of some higher frequencies with respect to the

magnitude of other frequencies hence improves the over-all signal to

noise ratio. = − 0.95[ − 1] 

Page 24: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 24/40

2.2 Framing

• The input speech signal is segmented into small frames of 

20ms length with 50% overlap with the adjoining frames to

create continuity.

Page 25: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 25/40

2.3 Windowing

• Each frame is multiplied with the hamming window in time

domain. This helps to reduce the discontinuity at the start and

end of each frames.

= 0.54 − 0.46 cos2

− 1 

Page 26: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 26/40

2.4 Noise threshold detection

• For detecting the starting of the utterance from the 3sec long

input speech signal, energy of each frame of the input speech

signal is calculated and stored into an array. Size of the energy

array will be equal to the total number of frames. This energy

array is arranged in the ascending order and mean of first 15

elements gives the energy of the noise. Threshold set was 10times the noise energy.

Page 27: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 27/40

2.5 Utterance detection

• Once the threshold has been calculated, all the elements in the energy

array which are greater than the threshold are replaced by 1 and the rest

by 0. Thus a Boolean array of the following form is obtained.

• Sometimes spikes due to the external noise crosses the threshold and

contributes 1 to the Boolean array. To remove these spikes a Median filter

VI in LabVIEW with left and right rank as 3 is used. The median filter

replaces the i th element in the Boolean array with the median of {

− 3, − 2, − 1, , − 1, + 2, + 3}elements.

Hence the median filter smoothen the Boolean array.

Page 28: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 28/40

• Now we use the Peak detector VI in LabVIEW to find the index of the start and

end of the utterance. Using these index extract the corresponding frames

containing the utterance.

• N.B

: In my project, all commands where of length less than 0.6sec. Sometimesspikes due to noise remained even after using the median filter and hence the

ending index was not detected accurately. But the start index was detected

accurately most of the time, so I used to extract 0.6sec of sound after the start

index.

Page 29: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 29/40

Step 3: Feature Extraction

Page 30: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 30/40

Block Diagram of Feature Extraction VI

Page 31: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 31/40

• FFT is done on each frame of the utterance and half of it is

taken.

•The spectrum of each frame is warped onto the Mel scale andthus Mel spectral coefficients are obtained.

• Discrete cosine transform is done on Mel spectral coefficients of 

each frame, hence obtaining MFCC.

•The first 2 coefficients of the obtained MFCC are removed asthey varied significantly between different utterances of the

same word.

• Liftering is done by replacing all MFCC except the first 14 by

zero.• The first coefficient of MFCC of each frame was replaced by the

log energy of that frame.

• Delta and Acceleration coefficients are found from the MFCC so

as to increase the dimension of the feature vector of theframes thereb increasin the accurac .

Page 32: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 32/40

• Delta coefficients are found from the following equation. Value

of p chosen was 1.

• Acceleration coefficients are found by replacing the MFCC in the

above equation by delta coefficients

• Feature vector is normalized by subtracting their mean from

each elements

Thus each frame of utterance is converted into a feature vector

of dimension 35.

Page 33: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 33/40

Step 4 : Feature Matching

• Dictionary with six sets has been created.

• In each set, the feature vector sequence of the words to be

recognized are stored.

• The feature of the test sequence is compared with each words

in the sets using DTW and the best match in each set is

outputted.

• The mode of all six set is considered to be the recognized

command.

• Threshold is set so that random speech signal doesn't result in

a match with the commands.

Page 34: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 34/40

Front Panel of the Dictionary sub VI

Page 35: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 35/40

Block Diagram of the Dictionary

sub VI

Page 36: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 36/40

Block Diagram of the DTW sub VI

Page 37: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 37/40

Video Demonstration of the VI

http://www.youtube.com/watch?v=aEqa-t_TWiY

Page 38: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 38/40

Limitations•

Environment DependentThe input speech feature vector is compared with a set of 

feature vectors in the dictionary which were recorded in a

particular environment. So when used from a different

environment the efficiency decreases unless the threshold and

the dictionary are updated accordingly.

• Speaker DependentAs the dictionary is trained by a particular user, the VI outputs

consistent results when used by the trainer.

Page 39: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 39/40

Questions..?

Page 40: Voice Command Recognition

7/27/2019 Voice Command Recognition

http://slidepdf.com/reader/full/voice-command-recognition 40/40

Thank You