speaker and speech recognition for secured smart home applications

International Conference on Advance Research in Computer Science, Electrical and Electronics Engineering

Sep 7, 2013 Pattaya

5

SPEAKER AND SPEECH RECOGNITION FOR SECURED SMART

HOME APPLICATION

R. Gomes1, S. Shaji

2, L. Nadar

2, V. Vincent

2

Dept. of Electronics and Telecommunication

Xavier Institute of Engineering, University of Mumbai

Mahim (W), Mumbai-400016, Maharashtra, India [email protected]

[email protected]

S. Patnaik

Dept. of Electronics and Telecommunication

Xavier Institute of Engineering, University of Mumbai

Mahim (W), Mumbai-400016, Maharashtra, India

[email protected]

ABSTRACT

The concept of a smart home refers to the idea of

having intelligent devices surrounding us responding to

our various needs as and when the situation arises for

e.g. switching on/off of lights and fans when an

individual enters or leaves a room, automatic

adjustment of the temperature of a room depending on

the ambient temperature etc. In the context of a smart

home an individual’s interaction with all the electrical

appliances is crucial giving him complete control and

freedom to control all the devices at home. However,

with this control a question of security arises. An

individual at his home would want access to all the

devices restricted to only his family members and

friends. To address the above simultaneous demand of

security (e.g. operation by family members only) and

automation (remote operation of multiple devices), in

this paper we present a concept of speaker recognition

for security and speech recognition for home

appliances automation. The goal is design and

implementation of a text independent speaker

recognition based on Mel-frequency Cepstrum

Coefficients (MFCCs) and Vector Quantization (VQ)

algorithm for security integrated with a speaker

independent speech recognition using Dynamic time

warping (DTW) algorithm for home appliances

automation.

KEYWORDS: Automation, Security, Speaker

Recognition, Speech Recognition, Mel Frequency

Cepstrum Coefficients (MFCCs), Vector Quantization

(VQ), Dynamic Time Warping (DTW)

I. INTRODUCTION

The human speech signal contains many discriminative

features. These features are unique to every individual

and serve as a biometric parameter which can be used

by robust voice based biometric systems to correctly

verify an individual‟s identity [1]. Unlike other

biometric parameters like fingerprint and iris, voice

based biometrics presents the advantage of remotely

accessing systems through the telephone network, this

makes it quite valuable in real time applications of

authentication and authorization over a large distance

[2]. Speaker recognition typically is the process of

automatically recognizing who is speaking on the basis

of information obtained from his speech. This technique

will make it possible to verify the identity of a person

accessing the system [2]. In the context of automation

in a smart home only an authorized user must be given

access to control all the devices and appliances at home.

In this case, for authenticating a user we use text

independent speaker recognition. Once access to the

system has been granted to the authenticated user, all

the appliances and device connected to the system must

be under his control. In order to accomplish this task we

use isolated word speech recognition for correctly

identifying the uttered words by matching it with the

reference templates stored in the database.

The proposed system in this paper involves three

phases. The first phase is the speaker recognition phase

to authenticate the user, the second phase is the speech

recognition phase to identify the word spoken by the

user for the purpose of automation and the third phase is

the device control phase which involves serially

communicating the results of identification to

PIC16F676 to toggle the status of the devices connected

to it.

II. SPEAKER RECOGNITION

Speaker recognition is the method of automatically

identify who is speaking on the basis of individual

information integrated in speech waves [2]. The process

of speaker recognition involves two phases, the testing

and the training phase. Both these phases involve

extracting the features vectors and its matching. This is

possible using MFCC algorithm and feature matching

using VQ and its optimization with Linde, Buzo and

Gray (LBG) algorithm.

Fig. 1 Block Diagram of MFCC Processor [3]

A. Mel-frequency Cepstrum Coefficients

The Mel-Frequency Cepstrum (MFC) is a

representation of short-term power spectrum of a sound.

The MFCCs are coefficients that collectively make up

an MFC. They are derived from a type of cepstral

representation of the audio clip (a nonlinear "spectrum-

of-a-spectrum") [3]. The difference between the

6

cepstrum and the mel-frequency cepstrum is that in the

MFC, the frequency bands are equally spaced on the

mel scale, which approximates the human auditory

system's response more closely than the linearly-spaced

frequency bands used in the normal cepstrum [1].

1) Frame Blocking: It has been assumed that over a

long interval of time speech signal is not stationary,

however over a sufficiently short interval of time

say 10-30ms it can be considered stationary. In

frame blocking, the continuous speech signal is

blocked into frames of N samples, with adjacent

frames being separated by M (M < N).The first

frame consists of the first N samples. The second

frame begins M samples after the first frame, and

overlaps it by N - M samples [3]. Similarly, the

third frame begins 2M samples after the first frame

(or M samples after the second frame) and overlaps

it by N - 2M samples. Typical values for N and M

are N = 256 (which is equivalent to ~ 30ms

windowing and facilitate the fast radix-2 FFT) and

M = 100 [1, 3].

2) Windowing: To minimize the signal discontinuities

at the beginning and end of each frame the concept

of windowing is used to minimize the spectral

distortion to taper the signal to zero at the

beginning and end of each frame. In other words,

when we perform Fourier Transform, it assumes

that the signal repeats, and the end of one frame

does not connect smoothly with the beginning of

the next one. In this process, we multiply the given

signal (frame in this case) by a so called Window

Function [3, 11]. There are many „soft windows‟

which can be used, but in our system Hamming

window has been used, which has the form

( )

(

) ( )

3) Fast Fourier Transform (FFT): The next

processing step is the Fast Fourier Transform,

which converts each frame of N samples from the

time domain into the frequency domain [3]. The

FFT is a fast algorithm to implement the Discrete

Fourier Transform (DFT) which is defined on the

set of N samples

∑

( )

The result after this step is often referred to as spectrum

or periodogram [5, 3].

4) Mel-frequency wrapping: Psychophysical studies

have shown that human perception of the frequency

contents of sounds for speech signals does not

follow a linear scale. Thus for each tone with an

actual frequency, f, measured in Hz, a subjective

pitch is measured on a scale called the „mel‟ scale.

The mel-frequency scale is linear frequency

spacing below 1000 Hz and a logarithmic spacing

above 1000 Hz. As a reference point, the pitch of a

1 kHz tone, 40dB above the perceptual hearing

threshold, is defined as 1000 mels [1, 3]. Therefore

we can use the following approximate formula to

compute the mels for a given frequency f in Hz:

( )

(

) ( )

5) Cepstrum: In this final step, we convert the log mel

spectrum back to time. The result is called the mel

frequency cepstrum coefficients (MFCC). The

cepstral representation of the speech spectrum

provides a good representation of the local spectral

properties of the signal for the given frame

analysis. Because the mel spectrum coefficients

(and so their logarithm) are real numbers, we can

convert them to the time domain using the Discrete

Cosine Transform (DCT). Therefore if we denote

those mel power spectrum coefficients that are the

result of the last step are

, K (4)

We calculate the mfcc‟s as

∑ ( ) [ (

)

]

( )

By applying the procedure described above, for each

speech frame of around 30msec with overlap, a set of

mel-frequency cepstrum coefficients is computed [3, 4].

These are result of a cosine transform of the logarithm

of the short-term power spectrum expressed on a mel-

frequency scale. This set of coefficients is called an

acoustic vector. Therefore each input utterance is

transformed into a sequence of acoustic vectors.

B. Feature matching using VQ

The state-of-the-art in feature matching techniques used

in speaker recognition includes DTW, Hidden Markov

Modelling (HMM), and VQ. In this paper, the VQ

approach is used, due to ease of implementation and

high accuracy [2]. Vector Quantization is the classical

quantization technique from signal processing which

allows the modelling of probability density functions by

the distribution of prototype vectors. It works by

dividing a large set of points into groups having

approximately the same number of points closest to

them. Each group is represented by its centroid point.

The density matching property of vector quantization is

powerful, especially for identifying the density of large

and high-dimensioned data. Since data points are

represented by the index of their closest centroid,

commonly occurring data have low error [1].

7

A vector quantizer maps k-dimensional vectors in the

vector space Rk into a finite set of vectors Y = {yi : i =

1, 2,….N}. Each vector yi is called a code vector or a

codeword and the set of all the code words is called a

codebook. Associated with each codeword, yi, is a

nearest neighbour region called Voronoi region, and it

is defined by

{ }

( )

Given an input vector, the codeword that is chosen to

represent it is the one in the same Voronoi region.

Fig. 2 Codewords in 2-dimensional space. Input

vectors are marked with an x, codewords are

marked with circles, and the Voronoi regions are

separated with boundary lines [1]

The representative codeword is determined to be the

closest in Euclidean distance from the input vector. The

Euclidean distance is defined by

( )

√∑( )

( )

where, xj is the jth

component of the input vector, and

yij is

the jth

is component of the codeword yi [1].

C. Clustering of Training Vectors using LBG

algorithm

After the enrolment session, the acoustic vectors

extracted from input speech of a speaker provide a set

of training vectors. As described above, the next

important step is to build a speaker-specific VQ

codebook for this speaker using those training vectors.

There is a well-known algorithm, namely LBG

algorithm [Linde, Buzo and Gray, 1980], for clustering

a set of L training vectors into a set of M codebook

vectors [3]. The algorithm is formally implemented by

the following recursive procedure:

1. Design a 1-vector codebook; this is the centroid of

the entire set of training vectors (hence, no iteration

is required here).

2. Double the size of the codebook by splitting each

current codebook yn according to the rule

( )

( )

Where n varies from 1 to the current size of the

codebook, and ε is a splitting parameter (we choose

ε=0.01).

3. Nearest-Neighbor Search: for each training vector,

find the codeword in the current codebook that is

closest (in terms of similarity measurement), and

assign that vector to the corresponding cell

(associated with the closest codeword).

4. Centroid Update: update the codeword in each cell

using the centroid of the training vectors assigned

to

that cell.

5. Iteration 1: repeat steps 3 and 4 until the average

distance falls below a preset threshold.

6. Iteration 2: repeat steps 2, 3 and 4 until a codebook

size of M is designed [3].

III. SPEECH RECOGNITION

Speech Recognition is the ability of a computer to

recognize general, naturally flowing utterances from a

wide variety of users [10]. Speaker independent isolated

word recognition for the purpose of automation in a

smart home has been described in this paper. The

process of isolated word recognition involves

acquisition of the speech sequence of the word uttered

by the user. This then followed by the extraction of

MFCC‟s or the acoustic feature vectors which is exactly

similar to the processes employed in speaker

recognition described in the above section. This then

followed by the DTW algorithm to identify the

correctly uttered word.

A. Dynamic Time Warping

DTW algorithm is based on Dynamic Programming

techniques as described in [10]. This algorithm is for

measuring similarity between two time series which

may vary in time or speed. This technique also used to

find the optimal alignment between two times series if

one time series may be “warped” non-linearly by

stretching or shrinking it along its time axis. This

warping between two time series can then be used to

find corresponding regions between the two time series

or to determine the similarity between the two time

series [11]. The principle of DTW is to compare two

dynamic patterns and measure its similarity by

calculating a minimum distance between them. The

classic DTW is computed as below. Suppose we have

two time series Q and C, of length n and m respectively,

where:

Q= q1,q2,q3….qi….qn (8)

C=c1,c2,c3.....cj…...cm, (9)

To align two sequences using DTW, an n-by-m matrix

where the (ith

, jth

) element of the matrix contains the

distance d (qi, cj) between the two points qi and cj is

constructed [10]. Then, the absolute distance between

8

the values of two sequences is calculated using the

Euclidean distance computation:

d (qi , cj) = (qi - cj)2 (10)

Each matrix element (i, j) corresponds to the alignment

between the points qi and cj. Then, accumulated

distance is measured by:

D(i, j) =min[ D(i-1, j-1), D(i-1, j) ,D(i, j-1) ] + d(i, j)

(11)

Using dynamic programming techniques, the search for

the minimum distance path can be done in polynomial

time P(t), using equation below:

P(t)=O(N2 V) (12)

where, N is the length of the sequence, and V is the

number of templates to be considered [11].

Theoretically, the major optimizations to the DTW

algorithm arise from observations on the nature of good

paths through the grid. These are outlined in Sakoe and

Chiba [11,12] and can be summarized as: Monotonic

condition, Continuity Condition, Boundary Condition,

Adjustment window condition and Slope constraint

condition.

IV. SYSTEM ARCHITECTURE

The application of speaker and speech recognition in

our proposed smart home system is shown in figure 7.

Fig. 3 Process flow of the proposed smart home

system

As described in figure 7 a prospective user must first be

authenticated to use the system, his speech sequences

are first acquired and analyzed using MFCC and VQ

LBG if it matches with the speaker templates then the

user is granted access. The next phase is the automation

phase, the authenticated user utters the name of the

device/appliance he wants to use, provided the

reference template of the word is stored and the device

is connected to the system. DTW algorithm insures

robust matching with the reference templates and on

correct recognition passes on the results acquired to the

PIC16F676 microcontroller using the RS232 standard

communication protocol. On receiving the appropriate

signals of the correctly recognized device/appliance, its

current status would be toggled.

A. Experimental Setup

As it can be seen from figure 7, the basic experimental

setup consists of mic which captures the utterances from

the user. Processing of the speech is done by the Matlab

Scripts which involves feature extraction using MFCC,

Feature matching and optimization using VQ and LBG

respectively, followed by isolated word recognition using

DTW. The phases of speaker and speech recognition are

carried out in Matlab following which the results of

authentication and identification are serially

communicated to the PIC16F676 microcontroller.

Computer mic Light Bulb PIC16F676 based RS232

Relay board

Fig. 4 Experimental set up for speaker and speech

recognition based device control

B. PIC16F676 based RS232 Relay Board

The PIC16F676 microcontroller has been used in our

system for communicating with Matlab to acquire the

results of the recognized word using the RS232

communications protocol. Interfacing with various

devices in our system has been accomplished by

making provisions for an array of relays.

Acquisition of Speech Sequence from the prospective

user

Analysis of the Speech Sequence for

Authentication

Speech Feature Extraction using MFCC

Speech Feature matching with the models in the

database using VQ LBG

Perform Speech Recognition using DTW

Grant of access to the authenticated user for

controlling devices using speech recognition

Acquire uttered speech sequence and extraction

of acoustic feature vectors(MFCC)

Recognition of the uttered word using DTW

Serially communicate the recognized word to

PIC16F676 using RS232 communication

protocol

Toggle the current status of the corresponding

device connected to the microcontroller via a

relay

9

Light Bulb 8 Relays ULN2803 PIC16F676 LM7805

Fig. 5 PIC16F676 based RS232 Relay Board

As shown in figure 9, our system provides provision for

8 devices as 8 relays are connected to the PIC16F676

microcontroller, these are in turn driven by ULN2803

high voltage, high current Darlington arrays for

providing the necessary switching signals to the relays.

V. RESULTS

The Speaker and Speech recognition algorithms were

successfully implemented in matlab. Speech feature

vector extraction using MFCC and feature matching

using VQ LBG have been successfully implemented in

matlab for speaker recognition thus fulfilling the

objective of authenticating a user. The figures below

describe the results obtained.

Fig. 6 Plot of mel-spaced filterbanks

Fig. 7 Plot of VQ codewords

Fig. 8 Results of successful Authentication

Fig. 9 Results of successful word Identification

VI. CONCLUSION

The implemented speaker recognition system was found

to have an accuracy of 80% Accuracy is compromised

if conditions like duration of silence, ambient noise

content, emotional and physical health of the speaker

vary during training and testing period. Thus we have to

ensure that these conditions remain same during both

the training and testing phases. The accuracy of speaker

recognition could be improved by using a larger

database of samples for training purposes. These

samples may be taken under varying conditions and

thus can present a complete representation of the trained

speaker during training.

The implemented DTW based speech recognition

system was found to have a high accuracy of 90%. The

recognition was followed by communication of the

results to the PIC16F676 microcontroller serially thus

switching on/off of the device connected to it. Thus, the

objective of security in a smart home by authenticating

a user using speaker recognition and automation in a

smart home using speech recognition have been

achieved and presented in this paper.

REFERENCES

1) Vibha Tiwari, “MFCC and its Application in

Speaker Recognition”, International Journal on

Emerging Technologies,ISSN: 0975-8364, Feb

2010

10

2) S. J. Abdallaha, I. M. Osman, M. E. Mustafa,

“Text-Independent Speaker Identification Using

Hidden Markov Model” World of Computer

Science and Information Technology Journal

(WCSIT) , ISSN: 2221-0741, Vol. 2, No. 6, 203-

208, 2012

3) Ch.Srinivasa Kumar et al., “Design Of An

Automatic Speaker Recognition System Using

MFCC, Vector Quantization And LBG Algorithm”,

International Journal on Computer Science and

Engineering (IJCSE), ISSN: 0975-3397, Vol 3 No:

8, August 2011

4) Srinivasan,”Speaker Identification and Verification

using Vector Quantization and Mel Frequency

Cepstral Coefficients” Research Journal of Applied

Sciences, Engineering and Technology, ISSN:2040-

7467, 4(1): 33-40, 2012

5) Anjali Bala et al. , ”Voice Command recognition

system on MFCC and DTW”, International Journal

of Engineering Science and Technology,

ISSN:0975-5462, Vol. 2 (12), 2010,

6) D. Subudhi, A.K. Patra, N. Bhattacharya, and P.

Kuanar, “Embedded System Design of a Remote

Voice Control and Security System”, TENCON

2008-2008 Region 10 Conference

7) Ian McLoughlin, “Applied Speech and Audio

Signal Processing”, Cambridge University Press,

2009

8) Jacob Benesty, M. Mohan Sondhi, Yiteng

Huang(Eds.),”Springer Handbook of Speech

Processing”

9) A Thakur, “Design of a Matlab based Automatic

Speaker Recognition and Control System”,

International journal of Advanced engineering

Sciences and Technologies, ISSN: 2230-7818, Vol

no 8, Issue no 1, 100-1

10) B Plannener, “Introduction to Speech Recognition”

March 2005, www.speech-recognition .de accessed

on 25th

April 2013

11) L Muda, M Begam and L Elamvazuthi, “Voice

Recognition Algorithms using MFCC and DTW

Techniques” Journal of Computing, volume 2 ,

issues 3, March 2010

12) Steve Cassidy, “Speech Recognition: Chapter 11:

Pattern Matching in Time”,

http://web.science.mq .edu.au/~cassidy/comp449/ht

ml/ch11s02.html, Accessed on 24th

April 2013