voice recognition system using wavelet transform and neural networks

8/6/2019 Voice Recognition System Using Wavelet Transform and Neural Networks

http://slidepdf.com/reader/full/voice-recognition-system-using-wavelet-transform-and-neural-networks 1/7

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/

WWW.JOURNALOFCOMPUTING.ORG 13

Voice Recognition System Using Wavelet

Transform and Neural NetworksBayan Alsaaidah, Abdulsalam Alarabeyyat, Moh'd Rasoul Al-Hadidi

Abstract— Speech is the natural way that the people interact with each other. And by there voice, they can do and re‐

mote any job. This study aims to make the voice recognition system more efficient by converting the original data to

seven levels; each level presents a wavelet transform and then examines which level of the seven levels presents the

best solution. This system is applied on 40 samples which presents eight words. This research is based on speech rec‐

ognized words using Neural Networks, based on limited dictionary. This paper begins with introduction of this study,

then presents some related works, and explained the experiment of this study; finally the conclusion and future works

are presented.

Index Terms — Speech Recognition, Wavelet transform, Neural Networks, Resampling.

—————————— ——————————

1 INTRODUCTION

Every day there is many people come to this world and

make the first sound to begin there life and they don’t

know that by this sound they can make there life very

comfortable and easy to communicate with people and

machines.

The speech recognition is the process by which a com‐

puter identifies your spokenwords. It means when you

are talking to your computer, the computer will correctly

recognize what you are saying.

Moreover, voice recognition [1] “is the technology by

which sounds, words or phrases are spoken by humans

that are converted into electrical signals, and these signals

are transformed into coding patterns to which meaning

has been assigned”. The sound recognition can be more

general than the voice recognition, but in this paper we

focus on the human voice because it is most often and

most naturally used to communicate with the humans

and machines.

Speech generation and recognition are used to com‐

municate between humans and machines. Rather than

using our hands and eyes, and also we can use our mouth

and ears. This is very convenient when our hands and

eyes should be doing something else, such as: driving a

car, performing surgery, or firing weapons at the enemy

[2].

These days, we need to do every thing quickly and to

save our times and do it without using our hands espe‐

cially when it busy with something else. To achieve our

goal we can give order to any system just by using our

mouth by saying the order and it will be done. This con‐

cept can be achieved by using the voice recognition which

is a process by which the words of the humans are con‐

verted into electrical signals and these signals are trans‐

formed into coding patterns to which meaning has been

assigned [3].

There is a difficult in using the voice as an input to a

computer presented by the differences between the hu‐

man speech and the traditional forms of the computer

input. Each human has a different voice, and the same

words can have different meanings when it is spoken in

different different contexts. To overcome these diffculties

there are many techniques and methods that can be used

for voice recognition system, one of these methods is by

using the artificial neural networks.

Artificial Neural Networks (ANNs) are computer systems

made from collections of artificial neurons. They accept a

vector of inputs and produce a vector of outputs. They

compute their results in constant time [4]. Like what we

know about the nervous system in the human body. They

are trained by presenting them with input datasets and

corresponding correct outputs, and working to minimize

the recognition errors by adjusting the weights of the

————————————————

bdulsalam Alarabeyyat is with Al‐balqa Applied University, Salt, Jordan.

ohʹd Rasoul Al‐Hadidi is with Al‐balqa Applied University, Salt, Jordan.

Bayan Alsaaidah is with Al‐balqa Applied University, Salt, Jordan.






network [4].The neural networks are used to design many

applications and the speech recognition is one of the most

important of it.

The main Significant Contribution of this research study

is the using of neural networks tool to design a system

that use the voice recognition technology, wher the voice

is processed by using the Wavelet transform. Neural net‐

work is applicable for many applications and it is popular

and used in many applications in the recently years. It is

friendly tool with matlab enviroment when the grammar

rules are not known.

According to the mentioned before, the main objective is

to build a system that works as speech recognition after

minimize some features of the voice and train the neural

networks to identify the spoken words, and then find the

best recognition after the wavelet transform. In a speech

recognition system, each input typically represents one

feature of the captured speech signal.

The combination of the voice feature strengths results in

an output vector that shows, for example, the likelihood

that these inputs represent various phonemes under con‐

sideration [4]. The neural network is a new technique that

based on training a model to recognize certain patterns of

voice so that when any words applied to the model and

have the same pattern it will be recognized.

Pattern recognition is the basis of today’s voice recogni‐

tion software. For any application the voice is converted

into digital data, which is then compared to information

stored in the programʹs database [5].

The comparison process of the recognition system uses

algorithms based on statistical techniques for predictive

modeling known as the Hidden Markov Model or HMM

or Neural Network or any other approches. The process

makes educated guesses about the audio sound pattern of

voice to predict the words that the user might be used[5].

Discrete Wavelet Transform (DWT) is an orthogonal func‐

tion which can be applied to a finite group of data. The

DWT and the Discrete Fourier Transform (DFT) are simi‐

lar in the orthogonality of the function, a signal passed

twice through the transformation is unchanged, the input

signal is assumed to be a set of discrete‐time samples, and

both transforms are convolutions [6].

The Discrete Wavelet Transform gives information

about the frequency function in the signal where it’s a

weakness in the DFT function. A wavelet is a little piece

of a wave. While theFourier transforms use a sinusoidal

wave carries with repeating itself to infinity, a wavelet

exists only within a finite domain, and its value is zero

elsewhere [7].

2 RELATED WORKS

There are many researches presented in voice recognition

system by using Artificial Neural Networks (ANNs), the following explanation introduce some of them.

In 1988 Murdock et al. improve speech recognition and

synthesis for disabled individuals using fuzzy neural

network. Their system involves three stages:

Dynamic word wrap matching is used to detect and align

candidate words; fuzzy neural‐net word recognition is

applied to input spectrogram patterns; a voice synthesizer

is used to complete the interactive loop. The system has a

recognition accuracy of 95‐98% [8].

In 1989 Nakamura and Shikano proposed system with speaker dependent which seem an updating on the pre‐

vious works. The algorithm was applied to Hidden Mar‐

kov Models (HMMs) and Neural Networks and evaluat‐

ed using a database of 216 phonetically balance words

and 5240 important Japanese words uttered by three

speakers. The HMM speaker adapted recognition rate for

b,d,g was 79.5%. The average recognition rate for the

three choices was about 91%. The algorithm was applied

to neural networks and resulted in almost the same per‐

formance [9].

In 1990 Hampshire and Waibel proposed the Single‐

speaker and multispeaker recognition system for the

voice‐stop consonants b, d, g using Time‐Delay Neural

Networks (TDNNs) with a number of enhancements, in‐

cluding a new objective function for training these net‐

works. The new objective function, called the Classifica‐

tion Figure of Merit (CFM) [10].

In 1994 the speech recognition using neural networks

used to controlling a robot as mentioned in [11]. Zhou et

al. activated robot arm controller by using the

VoiceCommander that based on neural networks.

In 1996 Nava and Taylor proposed a system with Neu‐

ro‐Fuzzy Classifier (NFC) with excellent classification

accuracy to solve the speaker‐independent systemʹs prob‐

lems. According to the results of this system, the NFC

shows better results than several existing methods [12].

In 2003 a 2‐D phoneme sequence pattern recogniion

using the fuzzy neural network was proposed by Kwan

andDong. They used the self‐organizing map and the

learning vector quantization to organize the phoneme






feature vectors of short and long phonemes segmented

from speech samples to obtain the phoneme maps. They

formed the 2‐D phoneme response sequences of the

speech samples on the phoneme maps by the Viterbi

search algorithm. Then they used these 2‐D phoneme re‐

sponse sequence curves as inputs to the fuzzy neural

network for training and recognition of 0‐9 digit‐voice

utterances [13].

Toyoda et al. proposed a system by using a multi‐

layered perceptron NN system for environmental sound

recognition. Environmental sound recognition depends

more on the robot computer system task. The input data

was the one‐dimensional combination of the instantane‐

ous spectrum at the power peak and the power pattern in

time domain. Two experiments were conducted using an

original database and a database created. The result of

recognition rate for 45 environmental sound data sets was

about 92%. They found that the new method is fast and

simple compared to the HMM‐ based methods, and suita‐

ble for an on‐ board system of a robot for home use, e.g. a

security monitoring robot or a home‐helper robot [14].

In 2007, Soltani and Ainon proposed an experimental

study on six emotions, happiness, sadness, anger, fear,

neutral and boredom. This experiment used speech fun‐

damental frequency, formants, energy and voicing rate as

extracted features. The features were selected manually

for different experiments in order to get the best results.

These features were included into a features vector with

different sizes as input for different neural network classi‐

fiers. The database which was used for this experiment is

the Berlin Database of Emotional Speech [15].

In the study of Al‐Alaoui et al. they implemented a new

pattern classification method, where they used Neural

Networks trained using the Al‐Alaoui Algorithm.

The proposed speech recognition system was part of the

Teaching and Learning Using Information Technology

(TLIT) project which would implement a set of reading

lessons to assist adult illiterates in developing better read‐

ing capabilities.They compared two different methods for

automatic Arabic speech recognition for isolated words

and sentences. The result showed that the using of the Al‐

Alaoui Algorithm better than HMM in the prediction of

both words and sentences [16].

Onishi et al. proposed their system in 2009. They con‐

structed an individual identification system with three‐

layered neural networks. The voice signals were prepro‐

cessed by Fast Fourier Transform (FFT), and then they

used as input data of the neural networks with a back‐

propagation learning algorithm. The results of this study

summarized by that the performance of the neural net‐

work were dependent on pronunciation, and that the

three‐layered neural networks were effective for an indi‐

vidual identification using voice patterns [17].

In 2010 Shahgoshtasbi proposed system that improves

the equality of speech recognition system. This system

has two parts: The first part filters the input signal and

packs it. Then it gets the average of three packets as an

identification of the signal and send it to the second part.

The second part which is based on the human auditory

cortex was an associative neural network that maps the

input set to a desired output set. By experiment this sys‐

tem is able to recognize a word even anoisy one [18].

3 EXPERIMENT

The design of the proposed system based on the prepro‐

cessing of the wave signal by using the wavelet transform, and also on the ANNs which are designed to train the

system to recognize the samples.

3.1 Recording the Voice

The first step of the voice recognition system is the sound

record of the words that will be recognized by the system.

The record process can be achieved by many methods,

such as the sound recorder that is in the accessories of the

windows, the Audio recorder in any program with the

input (N, Fs, and CH). That is records N audio samples

at

Fs

Hertz

from

CH

number

of

input

channels.

With

the WAVE recording as output, and the third method is

the voice record with matlab environment, which is

achieved by using a list of commands written in the

command window and record the desired voice. In this

system we record 8 words; each one is recorded 5 times, so

we have 40 voice samples. Table 1 summarizes the recorded

words in both Arabic and English languages.

TABLE 1THE WORDS IN THE RECOGNITION SYSTEM

English words Arabic words

Open eftah

Close egleg

Right yameen

Left yasar

3.2 Analysis of voice signal

The designed system works on a limited vocabulary

consists of 8 words. Each words recorded with input

parameters as (44100 Hz, 16, stereo).It was read with this

© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617






parameters with two channel for each one. The analysis

process begins with resampling of the voice signals to

minimize the sample’s size without any effects on the

content of these samples. The human can hear the sound

up to 20 KHz rates so the maximum rate is 2 x 20 KHz =

40 KHz maximum rates according to the relation be‐

tween fs and fm that is shows that fs greater than or

equal to double fm. The resampling here convert

44100 Hz to be 40000 Hz by rate of 10‐15%, this step

aimed to minmize the size of data to minmize the sys‐

tem’s time. The resampling process use one channel in this

system that is the left one.

The resampling process involves converting a sampled

signal from one sampling frequency to another fre‐

quency without any changing on the period of the

sample if the sampled audio was played at the new rate

directly. Figure 1 shows a resampling voice signal of our

system (egleg).

Figure 1: Resampled voice signal

3.3 Wavelet TransformDiscrete Wavelet Transform (DWT) is an orthogonal

function which can be applied to a finite group of data.

Use a separate file for each image. The functionality of

DWT is like the Discrete Fourier Transform(DFT), in

that the orthogonality of the function, a signal passed

twice through the transformation is unchanged, the in‐

put signal is assumed to be a set of discrete‐time samples,

and both transforms are convolutions [6].

In the voice recognition system the data was compresse

by

using

DWT

to

be

smaller

and

minimize

some

features.

In the proposed system the data that were resampled con‐

verted to another forms by using the wavelet transform,

this process applied on the data seven times and after that

the data begin too small so seven times are enough.

Table 2 shows the seven levels of the wavelet trans‐

form that applied on the data and also the original data

is remain to see what is the best one of them.

TABLE 2THE TRANSFORMATED LEVELS

Levels Audio rates(Hz)

0 40000

1 20000

2 10000

3 5000

4 2500 5 1250

6 625

7 313

The proposed system used the discrete wavelet trans‐

form that can be calculated by the equation 1:

Where the range of the summation is determined by the

specified number of nonzero coefficients M. The number of

nonzero

coefficients

is arbitrary,

and

they

will

be

the order

of the wavelet. The value of the coefficients is determined

by constraints of orthogonality and normalization, this

value is not arbitrary [6].

The processing of the wavelet transform begins with

load‐ ing all the samples including the original sample.

Then for each sample the wavelet filter is constructed us‐

ing first daubechies and the Discrete Wavelet Transform

(DWT) is applied. The daubechies used as mother wavelet

function because it is better to remove the noise from

the signal. For each sample the DWT applied seven

times to obtain seven levels of transformation so the

overall transformation equal to 40 samples x 7 levels = 280 wavelet transforma‐ tions. Then the result of these

transformations is saved in the database to be an input to the

Artificial Neural Networks (ANNs).

3.4 Neural Network Topology

The topology of the designed ANN which is used in the

system is shown in Figure 2. In this system there are 8

networks, each level of the wavelet transform has a net‐

work. In the next subsections the detail description of

each component of this topology is presented.

Figure 2: The topology of one ANN






3.5 Input Layer

As shown in figure 2 the input layer has 40 neurons that

present the audio signals that will enter the network. The

value of this input has minimum and maximum value

depends on the input audio.

3.6 Output Layer

The numbers of neurons in output layers are eight neurons de‐

pending on original sample and the seven levels of the

wavelet transformation. For each sample that will be rec‐

ognized there are eight output to show which signal can

give a best recognition, for this reason there are eight neu‐

rons in the output layer.

3.7 Hidden Layer

In this neural network there is a single hidden layer that

has 25 neurons. These numbers depends on experiments

in the training of neural network and choose the best

number

that

gives

a best

training.

There

is

no

exact

rule that gives the best number of neurons in the hidden

layer. The choosing of the optimal number of neurons in

the hidden layer is determined by doing many experi‐

ments and then chooses the best one that is 25 in this sys‐

tem. This layer is important to determine the best perfor‐

mance of the ANN.

3.8 Recognizing Process

The most important part of the system is the recognition

process of the given patterns. There are many methods

that achieve this process, the most common used are Hid‐

den

Markov

Model

(HMM)

and

ANNs.

The

system

here has a limited dictionary so the best method to

recognize is the ANNs, while HMM is better with

speech recognition that has a big dictionary.

The recognition process is achieved by using neural

networks that i s offered by Matlab software. In the neu‐

ral networks the recognition process is done by using the

special function called sim which indicate the similarity

between the sample and the trained samples.

3.9 The Training Stage

This system built the neura l networks by using the

MultiLayer Perceptron (MLP) structure. Each network of

the system has three layers with a sigmoidal function in

the hidden layer followed by a training function Gradi‐

ent descent (gdx) that use a momentum and adaptive

learning rate backpropagation. The momentum used in

the backpropagation algorithm to achieve a faster global

convergence [19]. Traingdx can train any network as long

as its weight, its input, and transfer functions have de‐

rivative functions.

After the network architecture has been established, it is necessary to determine which value of weights must be

assigned to the network to minimize the error rate.To

assign this weights, the backpropagation training algo‐

rithm is used to train the network. The weights of

the networks are established, and also the biases are es‐

tablished.

The Sum‐squared error goal, the maximum number

of epochs to train, and the Momentum constant are

presented to the neural networks. After the input X

and the desired output Y are presented to the net‐

work, then the network use the input X to calculate the

output O, this value differs from the desired output Y

or the target output. The difference between the de‐

sired output and the actual output is computed and is

called the error.

After that the error is computed using the mean squared

error (MSE). The error is propagated backward to change

the weights in order to minimize the error rate. This pro‐

cess is repeated for a series of experimental data during

the training process to reach the error rate goal. This is

the summary of the learning process in the proposed

ANN.

The b a c k p r o p a g a t i o n algorithm i s used to t r a i n

the ANNs. The backpropagation algorithm has two passes

in the training process:

1. Forward pass: begins with initialize the weights

and biases. Then apply the input array X to the in‐

put layer. After that calculate the inputs and the out‐

puts for the hidden layer. This is done by finding the

summation of multiplying the input values by the

weights for each neuron in the hidden layer. The

output for each neuron is calculated by applying

the suitable activation f u n c t i o n f o r t h e comput‐

ed summation. In this system, the logistic activation

function is used. Then calculate the inputs and the

outputs for the output layer. This is done by finding

the summation of multiplying the output v a l u e s

from the hidden layer by their corresponding

weights for each neuron in the output l a y e r . The

output for each neuron is calculated by applying

the logistic activation function as defined fo r this

summation. The result is the network output o. The

calculated network output O is compared with

t h e desired output Y. If there is a difference, then

the error is computed using the Equation (2).

Error = Y − O (2)






2. Backward pass: The backward step presents the

adjusting of the weights for the output layer, and

also for the hidden layer. The forward pass and the

backward pass are repeated for each sample in the

training set. After all samples are trained, the mean

square error (MSE) is computed using Equation (3):

Where n represents the size of the training set. The

training sets are maximized by adding a noise to

achieve more size of input to increase the precise of

training, the number converted from 40 to 160 sam‐

ples.

The developed neural network is trained well, accord‐

ing to the figure 3 that shows the performance and the

target of the NN of the best level of the DWT (level 7).

Figure 3: The performance of ANN

The training process of this ANN is reached to the target

output by

extremely

99%

as

shown

in the

figure

4.

Figure 4: The Regression of ANN

3.9 The Testing Stage

The testing stage is the final stage of the proposed system

that aims to examine the results and determine the accu‐

racy of the system.

To determine the performance of the voice recognition

system, the accuracy should be calculated by using the

following equation:

Where • AR: Accuracy Rate.

• CR: Correctly Recognized.

• S: Samples that will be examined.

By using this formula, the accuracy of this system is 90%.

After the result is presented and analyzed, the percentage

of recognition of each level of the wavelet transformations

shows that the level seven give the best recognition per‐

centage as shown in table 3.

TABLE 3R ECOGNITION RATES

Levels Recognition Rate

0 12.51% 1 12.49% 2 12.46% 3 12.53% 4 12.50% 5 12.44% 6 12.51% 7 12.57%

Which means that the best way to reach to an efficient voice recognition system is by applying the wavelet trans‐

formation of the voice sample seven times and then rec‐

ognize it. The following figures (5 &6) show the differ‐

ence between the original sample (of the word yasar) and

the same sample after the DWT is applied on it seven

times, This is clear that the DWT compressed the size of

this sample.

Figure 5: The Original Sample

Figure 6: The 7’th Transformation






4 CONCLUSION

In this study, the multilayer p e r c e p t r o n is used as

structure of the ANN and the backpropagation training

algorithm is used to train the developed ANN.

The proposed system applied the DWT o n the orig‐

inal data seven times to convert to a smaller data. Each

level of this transformation has an individual network with the same parameters except the input value. The

recognition process can be achieved by using using the

special function called sim which indicate the similarity

between the sample and the trained samples.

The testing process shows that the best level that

gives the higher recognition rate is the seven level, that

ensure that the DWT is an effecient approach that com‐

pressed the data to minimize it’s features that will

make the recognition process faster and it improved the

accuracy of the system. The accuracy of this system is

80%‐100% according to the sample and the overall accura‐

cy is 90%. The performance of the ANN of the seventh

level is extremely 99%.

5 FUTURE WORKS

There are many directions are recommended to enhance

the voice recognition system using ANNs, such as: im‐

proving the accuracy of the voice recognition system by

training the ANNs on more data or by taking a specif‐

ic duration of the voice sample to minimize the data

and eliminate all unnecessary durations. Improving the

recognition system by using sentences not only words in

the training and recognizing processes. Comparing the

accuracy of the system that is applied on female voice

with the same system which is applied on male voice.

Finally, applying the voice recognition system on anoth‐

er type of ANNs.

REFERENCES

[1] B. Juang, L. Rabiner, “ Fundamentals of speech recog‐

nition”, PTR prentice‐hall,Inc.,A simon and schuster

company, 1993.

[2] www.physics.otago.ac.nz/internal/elec401/dsp

‐

smith/ch01.pdf. Accessed on 10‐12‐2010.

[3] R. Adams,” Sourcebook of automatic identification

and data collection”, Van Nostrand Reinhold, New

York, 1990.

[4] D. Colton,” Automatic speech recognition tutori‐

al”,2003.

[5] Clariety,” Voice Recognition Technology The Perfect

Computer Interface for the Real Estate Industry”, Clarei‐

ty Consulting & Communications, Inc., 2004.

accessed on 22‐1‐2011.

[6] T. Edwards,” Discrete wavelet transforms: Theory

and implementation”, Technical report, Stanford Uni‐

versity, 1991.

[7]www.thepolygoners.com/tutorials/dwavelet/dwttut.ht

ml. Accessed on 20‐2‐2011.

[8] R. Murdock, J. Husseiny, A. Liang, E. Abolrous, S. Ro‐

driguez, “Improvement on speech recognition and syn‐

thesis for disabled individuals using fuzzy

neural net retrofits”, Neural Networks, IEEE Interna‐

tional Conference on 24‐27 Jul, 1988.

[9] S. Shikano, K. Nakamura,” Speaker adaptation applied

to HMM and neural networks”, Acoustics, Speech, and

Signal Processing, ICASSP‐89., International Con‐

ference on 23‐26 May, 1, 1989.

[10] , J.B. A II Waibel, A.H. Hampshire, “novel objec‐

tive function for improved phoneme recognition using

time‐delay neural networks”, Neural Networks, IEEE

Transactions, V 2(216‐228), 1990.

[11] K.Ng, Y. Zhou, R. Ng, “A voice controlled robot

using neural network”, Intelligent Information Sys‐

tems. Second Australian and New Zealand Conference

on 29 Nov‐2 Dec, 1994.

[12] P.A. Taylor, J.M. Nava,” Speaker independent voice

recognition with a fuzzy neural network”, Fuzzy Sys‐

tems, Proceedings of the Fifth IEEE International

Conference on 8‐11 Sep, 3, 1996.

[13] H.Dong, X. Kwan, “Phoneme sequence pattern

recognition using fuzzy neural network”, Neural Net‐

works and Signal Processing, Proceedings of the 2003

International Conference on 14‐17 Dec., 1, 2003.

[14] S. Ding, Y. Liu, Y. Toyoda, J. Huang, “Environmental

sound recognition by multilayered neural networks”,

Computer and Information Technology, CIT ʹ04. The

Fourth International Conference on 14‐16 Sept., 2004.

[15] K. Ainon, R. Soltani, “Speech emotion detection

based on neural networks”, Signal Processing and Its

Applications. ISSPA 2007. 9th International Sympo‐

sium on 12‐15 Feb., 2007.

[16] J. Azar, E. Yaacoub, M. Al‐Alaoui, L. Al‐Kanj,

“Speech recognition using artificial neural networks and

hidden markov models”, In IMCL2008 Conference, 2008.

[17] A. Hasegawa, H. Kinoshita, K. Kishida, S.Onishi,

S. Tanaka, “Construction of individual identifica‐

tion system using voice in three‐layered neural net‐

works”, Intelligent Signal Processing and Communi‐

cation Systems. ISPACS 2009. International Sympo‐

sium on 7‐9 Jan., 2009.

[18] D. Shahgoshtasbi, “ A biological speech recognition

system by using associative neural networks”, World

Automation Congress (WAC), 2010.

[19] V. Sandrasegaran, K. Venayagamoorthy, G.K.

Moonasar, “Voice recognition using neural Networks”,

Communications and Signal Processing, COMSIG ʹ98.

Proceedings of the 1998 South African Symposium on 7‐8

Sep, 1998.

voice recognition system using wavelet transform and neural networks

Documents