voice recognition using distributed arti cial neural … master’s thesis voice recognition using...

2016

Master’s thesis

Voice Recognition using

Distributed Artificial Neural

Network for Multiresolution

Wavelet Transform Decomposition

1187001 Bandhit Suksiri

Advisor Prof. Masahiro Fukumoto

August 2016

Course of Information Systems Engineering

Graduate School of Engineering, Kochi University of Technology

Abstract

Voice Recognition using Distributed Artificial Neural

Network for Multiresolution Wavelet Transform

Decomposition

Bandhit Suksiri

This paper presents a new voice recognition method named as Signal Clustering

Neural Network by simple ANN model with a single channel microphone and Wavelet

Transform feature extractions, which is achieved to increase the high recognition rates

up to 95 per cent instead of Short-time Fourier Transform feature extractions at noises

up to 70 dB as in the normal conversation background noises. The performance evalua-

tion has been demonstrated in terms of correct recognition rate, maximum noise power

of interfering sounds, Receiver Operating Characteristic and Detection Error Tradeoff

curves. The proposed method offers a potential alternative to intelligence voice recog-

nition system in computational linguistics and speech controlled robot application.

key words Discrete Wavelet Transform, Voice Recognition, Feature Extractions,

Artificial Neural Network, Signal Clustering Neural Network.

– i –

Contents

Chapter 1 Preface 1

Chapter 2 Introduction of Human Speech and Speaker Recognition 3

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Summary of the Technology Progress . . . . . . . . . . . . . . . . . . . . 5

Chapter 3 Wavelet Theory 9

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Frequency Analysis using Fourier Transform . . . . . . . . . . . . 10

3.2.2 Time-Frequency Analysis using Fourier Transform . . . . . . . . 11

3.3 Fundamental of Wavelet Transform . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Continuous Wavelet Transform . . . . . . . . . . . . . . . . . . . 13

3.3.2 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . 14

3.4 Heisenberg ’s Uncertainty Principles . . . . . . . . . . . . . . . . . . . . 16

Chapter 4 Neural Network Theory 19

4.1 Biological Inspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 General Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.2 Transfer Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.1 A Layer of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . 28

– ii –

Contents

4.3.2 Multiple Layers of Neurons . . . . . . . . . . . . . . . . . . . . . 29

4.3.3 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Optimization Method: Backpropagation . . . . . . . . . . . . . . . . . . 33

4.5.1 Performance Index . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5.2 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5.3 Sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Chapter 5 Implementation of Artificial Neural Network and Multi-

level of Discrete Wavelet Transform for Voice

Recognition 40

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Proposed Voice Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.2 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.3 Artificial Neural Network Model . . . . . . . . . . . . . . . . . . 44

5.2.4 Decision Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 6 Reinforced Voice Recognition using Distributed Artifi-

cial Neural Network with Time-Scale Wavelet

Transform Feature Extraction 56

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Proposed Voice Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 58

– iii –

Contents

6.2.1 New Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 58

6.2.2 Distributed Artificial Neural Network Model . . . . . . . . . . . 59

6.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Chapter 7 Conclusions 70

Acknowledgement 73

References 75

– iv –

List of Figures

3.1 Schematic representation of the discrete wavelet transform. . . . . . . . 17

3.2 Heisenberg boxes of wavelets (left) and STFT (right). . . . . . . . . . . 18

4.1 Schematic Drawing of Biological Neurons. . . . . . . . . . . . . . . . . . 21

4.2 Single Input Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Multiple Input Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Abbreviated Notation of Neuron. . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Log-Sigmoid Transfer Function. . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 Soft-Max Transfer Function. . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.7 Single Layer of Neural Networks. . . . . . . . . . . . . . . . . . . . . . . 29

4.8 Three Layer of Neural Networks. . . . . . . . . . . . . . . . . . . . . . . 30

5.1 The proposed voice recognition overview. . . . . . . . . . . . . . . . . . 42

5.2 The Proposed DWT Filter Bank representation. . . . . . . . . . . . . . 45

5.3 The comparison of STFT (left) and CWT (right). . . . . . . . . . . . . 46

5.4 All-features-connecting topology. . . . . . . . . . . . . . . . . . . . . . . 47

6.1 The Proposed TSDWT Filter Bank representation. . . . . . . . . . . . . 60

6.2 The comparison CWT (left) and TSDWT (right). . . . . . . . . . . . . . 61

6.3 Double-features-connecting topology. . . . . . . . . . . . . . . . . . . . . 62

6.4 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

– v –

List of Tables

2.1 State-of-the-art Level of ASR techniques . . . . . . . . . . . . . . . . . . 7

4.1 List of Transfer Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 The Proposed Neuron Network Architectures . . . . . . . . . . . . . . . 46

5.2 Features Set for Experimentation . . . . . . . . . . . . . . . . . . . . . . 48

5.3 The First Experimental Configuration . . . . . . . . . . . . . . . . . . . 50

5.4 The First Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 50

5.5 The Second Experimental Configuration . . . . . . . . . . . . . . . . . . 51

5.6 The Second Experimental Results . . . . . . . . . . . . . . . . . . . . . . 51

5.7 The Third Experimental Configuration . . . . . . . . . . . . . . . . . . . 52

5.8 The Third Experimental Results . . . . . . . . . . . . . . . . . . . . . . 52

5.9 The parameter optimization results . . . . . . . . . . . . . . . . . . . . . 53

5.10 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1 The Proposed Neuron Network Architectures . . . . . . . . . . . . . . . 61

6.2 The First Experimental Configuration . . . . . . . . . . . . . . . . . . . 63

6.3 The First Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 64

6.4 The Second Experimental Configuration . . . . . . . . . . . . . . . . . . 65

6.5 The Second Experimental Results . . . . . . . . . . . . . . . . . . . . . . 66

6.6 The parameter optimization results. . . . . . . . . . . . . . . . . . . . . 67

– vi –

Chapter 1

Preface

This dissertation presents an implementation of simple Artificial Neural Network

model and multilevel of Discrete Wavelet Transform as feature extractions, which is

achieved to increase the high recognition rates up to 95% instead of Short-time Fourier

Transform in the conversation background noises at noises up to 65 dB. The performance

evaluation has been demonstrated in terms of correct recognition rate, maximum noise

power of interfering sounds, hit rates, false alarm rates and miss rates. Furthermore,

this research presents a new voice recognition method named Signal Clustering Neural

Network by simple Artificial Neural Network model with a single channel microphone

and Wavelet Transform feature extractions, which is achieved to increase the high recog-

nition rates up to 95% instead of Short-time Fourier Transform feature extractions and

previous proposed feature extractions at noises up to 70 dB as in the normal conversa-

tion background noises. The performance evaluation has been demonstrated in terms of

correct recognition rate, maximum noise power of interfering sounds, Receiver Operat-

ing Characteristic and Detection Error Tradeoff curves. Those proposed methods offer

a potential alternative to intelligence voice recognition system in computational lin-

guistics, speech controlled robot application, speech analysis-synthesis and recognition

applications.

This dissertation consisted of five different chapters. First, a brief research’s histo-

ries in automatic speech and speaker recognition are presented in past 65 years in order

to provide a technological perspective and an appreciation of the fundamental progress

– 1 –

that was accomplished in important area of speech communication, which described in

chapter 2 on the following page. Many techniques have been developed and sufficient

for exhibited robust recognition. However, many challenges have yet to be overcome

before the ultimate goal of creating machines that can communicate naturally with

humans is achieved. The satisfactory performance under a broad range of operating

conditions are required in order to archive the machine communicate naturally with hu-

mans. This research focuses on the improvement of feature extraction by using Wavelet

transform instead of Fourier transform, which described the reasons of replacement in

chapter 3 on page 9. The speaker adaptation and speech understanding is improved

by famous machine learning named Artificial Neural Network, which describe in chap-

ter 4 on page 19. This chapter gives an introduction to fundamental of neural network,

learning rules, network architectures, mathematical analysis of these networks, and on

their application to practical engineering problems or potential solution, such as non-

linear regression, pattern recognition, signal processing, data mining, control systems

and real world problems. Last importantly chapters, the proposed feature extraction

behavior is allow to characterized input signal by utilizes Wavelet transform instead

of Fourier transform. This research aims to develop the ability to classify command

input signals, improve accuracy of voice command recognition and support the voice

recognition of intelligence voice recognition system in computational linguistics, speech

controlled robot application, speech analysis-synthesis and recognition applications by

using Distributed Artificial Neural Network, which shown in chapter 5 on page 40 and

chapter 6 on page 56.

– 2 –

Chapter 2

Introduction of Human Speech

and Speaker Recognition

This chapter, we present brief research’s history in automatic speech and speaker

recognition. This chapter surveys the major themes and advances made in the past 65

years of research in order to provide a technological perspective and an appreciation of

the fundamental progress that has been accomplished in this important area of speech

communication. On the one hand, the outstanding techniques have been developed. On

the other hand, many challenges have yet to be overcome before the ultimate goal of

creating machines are achieved, which ability communicate naturally with humans. The

machines have to deliver a satisfactory performance under a broad range of operating

conditions.

2.1 Introduction

Speech is the primary means of communication between humans. Many evidence

feeding from technological inquisitively researcher about the mechanisms for mechanical

realization of human speech capabilities to the desire to automate simple tasks, which ne-

cessitate human-machine interactions, research in automatic speech and speaker recog-

nition by machines has attracted a great deal of attention. Based on major and branch,

statistical modeling of speech, automatic speech recognition systems and many extensive

– 3 –

2.2 Speech recognition

application work with voice are required human-to-machine interface, e.g., automatic

call processing in telephone networks and query-based information systems that provide

updated travel information, stock price quotations and weather reports.

Reference to chapter’s title, the “Human Speech and Speaker Recognition” is refer

to two type of voice recognition, “Speaker Recognition”, which determine who is speak

and “Speech recognition”, which determine what is being said or words, which explains

in the next section.

2.2 Speech recognition

The speech recognition is known as Automatic Speech Recognition (ASR), com-

puter speech recognition or Speech-to-Text. A speech recognition is the interdisciplinary

subfield of computational linguistics, which incorporates knowledge and research in the

linguistics, computer science and electrical engineering fields in order to develop method-

ologies and technologies, which enables the recognition and translation of spoken lan-

guage into texts, words or sentences by computers, computerized devices or robotics.

On the one hand, some type of speech recognition utilizes training sets where an

individual speaker, their words or isolated vocabulary into the system. The speech

recognition analyzes the person’s various voices and utilizes in order to achieve identifi-

cation of that person’s speech and increases accuracy recognition. On the other hand,

a speech recognition does not utilizes training sets is known as speaker independent

systems.

From the technology perspective, the speech recognition field has a long history

with several waves of major innovations. In the past 65 years of research, the field has

benefited from advances in machine learning and big data processing structure. The ad-

vances are evidenced importantly by the worldwide industry and company adoption of a

– 4 –

2.3 Speaker recognition

variety of machine learning, designing and deploying speech recognition systems not only

by the academic papers published in the field. These speech industry competitors are

Google, Microsoft, Hewlett Packard Enterprise, IBM, Baidu, Apple, Amazon, Nuance,

IflyTek, etc. Many of competitor proposes a core technology in their speech recognition

systems, which based on fundamental digital signal processing and information theory

[1].

2.3 Speaker recognition

A speaker recognition or voice recognition is an identification of a person from

characteristics of voice biometrics. The difference between speaker recognition and

speech recognition are described, speaker recognition is recognition who is speaker or the

act of authentication person and speech recognition is recognition what is person said.

Unfortunately, these two terms are frequently confused. Moreover, a voice recognition

is double represented for both different recognitions. The speaker recognition is able

to simplify the task of translating speech, which trains on specific person’s voices. The

speaker recognition is used to authenticate or verify the identity of a speaker as shown

in a part of a security process. The long history of speaker recognition is described as

shown in [1].

2.4 Summary of the Technology Progress

Brief summaries of the technology progress research in speech and speaker recogni-

tion was shown in [1]. It can be seen that systems are intensively carried out worldwide

spurred on by Technological advances in signal processing, algorithms, architectures

and hardware. The technological progress in the 65 years can be summarized by the

following changes;

– 5 –


1. Template matching to corpus-base statistical modeling, e.g. HMM and n-grams

2. Filter bank/spectral resonance to Cepstral features

3. Heuristic time-normalization to DTW/DP matching

4. The “distance”-based to likelihood-based methods

5. Maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI

6. Isolated word to continuous speech recognition

7. Small vocabulary to large vocabulary recognition

8. Context-independent units to context-dependent units for recognition

9. Clean speech to noisy/telephone speech recognition

10. Single speaker to speaker-independent/adaptive recognition

11. Monologue to dialogue/conversation recognition

12. Read speech to spontaneous speech recognition

13. Recognition to understanding

14. the single-modality to multimodal speech recognition

15. Hardware recognizer to software recognizer, and

16. No commercial application to many practical commercial applications.

The majority of technological changes have been directed toward the purpose of

increasing recognition robustness including many other significant techniques. Most

of these summaries cover in both the fields of speech and speaker recognition. These

recognitions have been developed for a wide variety of applications, ranging from small

vocabulary keyword recognition over dialed-up telephone lines to medium size vocabu-

lary voice interactive command and control systems for business automation, large vo-

cabulary speech transcription, spontaneous speech understanding and limited-domain

speech translation.

– 6 –


Table 2.1 State-of-the-art Level of ASR techniques

Processing Techniques State-of-the-art Level

Signal Conditioning 1

Speech Enhancement 3

Digital Signal Transformation 1

Analog Signal Transformation 1

Digital Parameter 2

Feature Extraction 3

Re-Synthesis 1

Orthographic Synthesis 3

Speaker Normalization 3

Speaker Adaptation 3

Situation Adaptation 3

Time Normalization 2

Segmentation And Labeling 2

Language Statistics 3

Syntax 2

Semantics 3

Speaker and Situation Pragmatics 3

Lexical Matching 3

Speech Understanding 2 – 3

Speaker Verification 1

Speaker Recognition 3

System Organization And Realization 1 – 3

Performance Evaluation 3

– 7 –


Many new technological methods were endorsed by researcher. Nevertheless, en-

countered a number of practical limitations that hinder a widespread deployment of

applications and services is still unsolved, which shown in [1] and summarized in table

2.1 where the state-of-the-art level 1 is various useful methods already overcome that

issues, level 2 is some methods have possibly to resolve or overcome the issues and level

3 is still no hope to solve the issues, in other words, “a long way to go”.

This research focuses on the improvement of Feature Extraction by using Wavelet

transform instead of Fourier transform, which describe the reasons of replacement in

chapter 3. Moreover, the proposed research behavior can be characterized by the fun-

damental of Wavelet transform from an input signal. Speaker Adaptation and Speech

Understanding is improved by using previous transform with the famous deep learning

algorithm such as Multi-layer of Artificial Neural Network, which described in chapter

4. The proposed research aims to develop the ability to classify command input signals,

improve accuracy of voice command recognition and support the computational lin-

guistics, speech controlled robot application, speech analysis-synthesis and recognition

applications.

– 8 –

Chapter 3

Wavelet Theory

A spectrogram is also known as spectral waterfall, voiceprint, or voice-gram, which

is a visual spectrum representation in a sounds, voices or signals depend on fluctuating

of time or other dependent parameters. The spectrogram is used to identifies spoken

words phonetically and to analyses the various calls of animals. The spectrograms

are utilized extensively in the development of the fields of music, sonar, radar, speech

processing, seismology, etc.

In order to construct a spectrogram, the Fourier transform is an effective method

for analyze the frequency components of the voices. However, if we focus the Fourier

transform over the whole time axis, it can be seen that transformation is imprecise

approximation only instant a particular frequency rises. A Short-time Fourier transform

utilizes a sliding window function in order to perform spectrogram, which gives the

information of both time and frequency. Nonetheless, the length of window function is

limited the resolution in frequency by some dependent parameters. A Wavelet transform

is proposed in order to suggest as solution to the problem above. Wavelet transform is

based on small wavelets with limited duration known as Mother Wavelet. The translated

version of wavelets is applicable concerned in order to optimize computational resource

as well as avenge of Fast Fourier transform, which were described in [2], [3], [4], [5], [6],

[7] and [8].

– 9 –

3.1 Overview

3.1 Overview

A Wavelet transform and a theoretical framework for wavelets is widely developed,

which offers a potential alternative variety of areas of sciences, such as intelligence

voice recognition system in computational linguistics and voice controlled application.

Before we investigate the valuable tool, we have to start on describe the first frequency

analysis tool named Fourier transform. Analyzing the signal with the Fourier transform

leads to information of each frequency spectrum. On the one hand, the standardization

analysis of Fourier transform is to explore a behavior of magnitude output |f(ω)|, which

does not refer to specified time lead to occur strange frequency peaks phenomenon.

Moreover, short transient outbursts are occurred lead to no noticeable contribution to

the frequency spectra. On the other hand, a Wavelet analysis supplies information both

time and frequency. Although, the both parameters are unable to exactly determined

simultaneously cause by the Heisenberg uncertainty principles relation. Analytically,

Wavelet transform utilizes the continuous version of various mathematical transforms.

However, the computer analysis requires sampled signals or consequently the discrete

versions of the transforms. Therefore, both continuous and discrete transforms are

investigated.

3.2 Literature Reviews

3.2.1 Frequency Analysis using Fourier Transform

A Fourier transform is an optimal method to analyze the frequency content. The

Fourier transform is named after its inventor named Joseph Fourier early 1800s [8]. The

Continuous Fourier Transform (CFT) of a function f is defined as follow;

f(ω) =

∫ ∞

−∞f(x)e−iωxdx (3.1)

– 10 –


and its inverse transform is defined as follow;

f(x) =1

2π

∫ ∞

−∞f(ω)eiωxdω (3.2)

where f(ω) is the amplitude of each sinusoidal wave eiωx in the function f(x). In

addition, its frequency analysis properties of the Fourier transform have some useful

mathematical relations, e.g., a convolution in a time domain corresponds to multipli-

cation in the frequency domain. Thus, the CFT utilizes in standard mathematical

derivations.

In contrast, a Discrete Fourier Transform (DFT) is discretized equation from f(x),

which a frequency output f [k] of a multi-periodic signal f [n] is defined as follow [8];

f [k] =1

N

N−1∑n=0

f [n]e−i2πkn/N (3.3)

and its inverse transform is defined as follow;

f [n] =N−1∑k=0

f [k]ei2πkn/N (3.4)

where f [k] is the amplitude of each sinusoidal wave eiωx in the discretized function

f [n]. In fact, the DFT is adapted to signals from real applications, which always have

finite length. Moreover, there is an outstanding method to perform the DFT called the

Fast Fourier Transform (FFT).

Above all, the Fourier transform is a suitable method for stationary signals.

Nonetheless, strange frequency peaks phenomenon is occurred.

3.2.2 Time-Frequency Analysis using Fourier Transform

A Short-time Fourier Transform (STFT) is an method to estimate the frequency

contents of function f(x) at an arbitrary time x = t is in order to cut off a piece of

– 11 –


frequency f and compute the Fourier transform in each time. The pervious notations

are easily argued for since it is a Fourier transform on a short piece of the function

during a short period of time and the restriction in time known as a translated window

or windows function [8]. STFT of a function f with respect to g is defined as follow;

Vgf(t, ξ) =

∫ ∞

−∞f(x)gt,ξ(x)dx (3.5)

where;

gt,ξ(x) = eiξxg(x− t) (3.6)

and g is known as real and symmetric window function, e.g., Rectangular, Ham-

ming, Blackman–Harris window, etc. It can be seen that STFT utilizes the sliding

window in order to perform spectrogram lead to offers an information of both time and

frequency.

Next section, we investigate a definition of the localization or time and frequency

spread of the window function g. The function g has the time spread σx(g) is given as

follow;

σ2x(g) =

∫ ∞

−∞(x− t)

2|gt,ξ(x)|2dx

σ2x(g) =

∫ ∞

−∞x2|g(x)|2dx

(3.7)

and frequency spread σω(g) is given as follow;

σ2ω(g) =

1

2π

∫ ∞

−∞(ω − ξ)

2|gt,ξ(x)|2dω

σ2ω(g) =

1

2π

∫ ∞

−∞ω2|g(ω)|2dω

(3.8)

where Vgf(t, ξ) is thought of as a measure of the frequency content of f at time t

and frequency ξ.

– 12 –

3.3 Fundamental of Wavelet Transform

STFT offers simple analytic calculation with simple sines and cosines orthogonal

functions. Moreover, recursive algorithms are applicable lead to revolution in scientific

computing. However, the length of window limits the resolution in each frequency band.

The algorithm does not handle fast variations of signal and discontinuities signal such

as overshoot and undershoot. Especially, STFT unable to guarantee correct calculation

with non-sinusoidal signal input. Hence, a new algorithm for time-frequency analysis is

proposed named Wavelet Transform.


The wavelet transform is alternative effective measurement tool for time-frequency

analysis as a highlight of interest in this dissertation. STFT time-frequency window

gt,ξ is replaced by a time-scale window ψa,b with similar properties with important

differences in term of resolution, which explains in [2], [3], [4], [5], [6], [7], [8] and the

next section.

3.3.1 Continuous Wavelet Transform

A function ψ with condition∫∞−∞ ψ(x)dx = 0 is defined as a Wavelet for every f , ψ

defines the continuous wavelet transform. The Continuous Wavelet Transform (CWT)

is given as follow;

fψ(a, b) =

∫ ∞

−∞f(x)ψa,b(x)dx (3.9)

where;

ψa,b(x) =1√aψ

(x− b

a

)(3.10)

where the function ψ is known as a Mother Wavelet, which is chosen to be localized

– 13 –


at x = 0 and at some ω = ω0 > 0. The a is input scales which represented as frequency

variable and b is input time variable.

CWT’s function ψa,b has time spread σx(ψa,b) and energy spread σω(ψa,b) around

ω0/a, which is defined as follows;

σ2x(ψa,b) =

∫ ∞

−∞x2|ψ(x)|2dx (3.11)

σ2ω(ψa,b) =

1

2π

∫ ∞

0

(ω − ω0)2|ψ(ω)|

2dω (3.12)

Moreover, Parseval’s formula gives a time-frequency interpretation;

∫ ∞

−∞f(x)ψa,b(x)dx =

1

2π

∫ ∞

−∞f(ω)ψa,b(ω)dx (3.13)

It can be that CWTmeasurement of the frequency is contented of f at the frequency

ω0/a and time b as well as CFT. In case of ψa,b(f) =√aψ(af), ψa,b is a dilated copy of ψ.

Changing the dilation parameter a, changes the support of ψa,b in time and rescales ψ.

Changing the translation parameter b, on the other hand, makes ψa,b change its location.

Hence, the varying of parameter set (a, b) computed on the entire time-frequency plane.

It can be seen that small scales correspond to high frequencies. This establishment of

the expression in Fourier analysis. This is reason the notation time-frequency plane is

utilized instead of the for Wavelet analysis because of that natural time-scale plane.

3.3.2 Discrete Wavelet Transform

In region of certain restrictions on the Mother Wavelet ψ, all information of the trans-

formed signal is protected when the Wavelet transform is sampled on certain discrete

subsets of the time-frequency plane. Precisely, the values of the continuous transform

in these points are the coefficients of a corresponding wavelet basis series expansion.

Reference of CWT equations as shown in (3.9). If the case a = 2−j , b = 2−jk, where

– 14 –


j, k is considered, then;

ψ2−j ,2−jk(x) =1√2−j

ψ

(x− 2−jk

2−j

)ψ2−j ,2−jk(x) = 2−j/2ψ

(2−jx− k

) (3.14)

where wj,k is the represent the values of CWT known as a wavelet coefficients. The

coordinates(2−jk, 2−j

)represents a dyadic grid in the time-scale plane. The values

correspond to the correlation between f and ψa,b at specific points (a, b). This sampling

offers sufficient information in order to perform a perfect reconstruction of the signal

possible with special conditions on the wavelet function ψ are fulfilled. Moreover, it

can be seen that is possible to construct a function ψ such that (ψj,k)j,k forms an

orthonormal basis leads to a magnificent equation named a Discrete Wavelet Transform

(DWT). The first one to construct a smooth wavelet basis was Janos-Olov Stromberg,

who has now been followed by several others, e.g., Daubechies and Meyer. a wavelet

decomposition with orthonormal wavelet basis functions of a function f is given as

follow;

f =

∞∑j=−∞

∞∑k=−∞

wj,kψj,k (3.15)

where;

wj,k = ⟨f, ψj,k⟩ (3.16)

In equations (3.15) and (3.16), DWT is a doubly infinite summation over both the

time index k and the scale index j. However, DWT allow summation to performed by

the finite times with under an acceptable tolerance. For infinitely supported wavelets

case, the wavelet energy has to concentrated within a certain interval. Overall, the finite

summation over k is valid with appropriate approximation, which explains in the next

paragraph.

– 15 –

3.4 Heisenberg ’s Uncertainty Principles

The decomposition of the signal into different frequency bands is simply obtained by

successive high pass and low pass filter of a time domain signal with three steps. First,

the original signal is first passed through a one band high pass filter g and a one band

low pass filter h. Second, the half of the samples is eliminated by down-sample transfer

function according to the Nyquist’s rule lead to the signal has a highest frequency of

p/2 radians instead of p, in other words, the signal is subsampled by 2 with discarding

every other sample. Last, the next level of decomposition is recursively executed from

step one to two. This constitutes one level of decomposition is expressed as follow;

φ(x) =∞∑

k=−∞

hkφ(2x− k) (3.17)

ψ(x) =∞∑

k=−∞

gkφ(2x− k) (3.18)

where φ(x) is the dilation signals of DWT, ψ(x) is the output signals of DWT,

hk and gk are two scaling functions obtained from each wavelet function which (3.17)

and (3.18) known as the scaling coefficients and wavelet coefficients, respectively. This

method also known as multiresolution analysis.

The coefficients hk and gk from the scaling and wavelet equations (3.17) and (3.18)

is operated as low pass filters or “approximations” and high pass filters or “details”,

respectively. Theses filters are utilized in a fast filter banks algorithm known as Mallat’s

algorithm, which operation costs consume less than the FFT. The algorithm is briefly

illustrated in figure 3.1 by [2].


The wavelet functions are localized both in time and frequency, However, wavelet

functions unable to an exact localization due to Heisenberg’s uncertainty principle. The

– 16 –


1( )

2( )

( )

1( ) 2( ) ( ) +1( )

LPF

g

HPF

h

Downsample by 2

HPF

h

LPF

g

LPF

g

HPF

h

Downsample by 2

Downsample by 2

Fig. 3.1 Schematic representation of the discrete wavelet transform.

– 17 –


localization measures σx(ψa,b) and σω(ψa,b) are illustrated as the sides of non-spaced

rectangles or boxes in the time-frequency plane shown in figure 3.2 by [6].

( , )

( , )

( )

( )

Fig. 3.2 Heisenberg boxes of wavelets (left) and STFT (right).

These all boxes have a certain area. Their sides are stretched and contracted by

the same factors a and a−1 as the corresponding wavelet functions. On the one hand,

the wavelet transform offers a higher time resolution at higher frequencies, which makes

the wavelet transform advantageous for signals analysis with contain both low and high

frequencies. Similarly, DWT filter bank is the suitable algorithm to obtain the scheme

of Heisenberg box shapes optimization. On the other hand, STFT equally offer the

resolution across the entire time-frequency plane. A short window allows the analysis

of transient components of a signal, such as high frequencies. A broader window allows

the analysis of low frequencies. It can be seen that STFT unable to analyze high and

low frequencies simultaneously.

– 18 –

Chapter 4

Neural Network Theory

Neural networks are motivated and initiated by the recognition problem in the

brain computes science field with an entirely different way from the conventional digital

computer. Ramon y Cajal introduced the idea of neurons as structural constituents of

the brain in 1911. Basically, neurons are five to six orders of magnitude slower than

computer logic gates. However, the brain adapts for the relatively slow rate operation of

a neuron by construct a countless number of nerve cells with massive interconnections

between the cells. Approximately, the estimated number of nerve cells is 10 billion

neurons in the human cortex and 60 trillion for connections result to the brain is an

enormously efficient structure.

Research in the field of artificial neural networks is attracted and increased since

1943. The first model of artificial neurons was presented by Warren McCulloch and

Walter Pitts. When Minsky and Papert published their book Perceptron in 1969, which

they showed the deficiencies of perceptron models, most neural network funding was

redirected and researchers left the field. Only a few researchers continued, such as

Teuvo Kohonen, Stephen Grossberg, James Anderson and Kunihiko Fukushima. Some

sophisticated proposals have been made from decade to decade. Finally, mathematical

analysis is solved the mysteries networks by the new models but still left many questions

open for future investigations. In other words, the study of neurons, interconnections,

and the brain’s elementary building blocks are one of the most dynamic and important

research fields in modern biology, which illustrate the relevance of endeavor between

– 19 –

4.1 Biological Inspiration

1901 and 1991 approximately 10% of the Nobel Prizes for Physiology and Medicine

were awarded to scientists who contributed to the understanding of the brain. It can

be seen that artificial neural network model is continuously researched and developed

by many researchers with from decade to decade in order to improve their issues.

This chapter gives an introduction to fundamental of neural network, learning rules,

network architectures, mathematical analysis of these networks and their application

to practical engineering problems or potential solution, such as nonlinear regression,

pattern recognition, signal processing, data mining, control systems and real world

problems, which reference from [9], [10], [11], [12], [13] and [14].


The brain is a large number of highly connected elements known as neurons. In

human brain field, the purposed neurons models have three principal components con-

sists of the dendrites, the cell body and the axon. The dendrites are tree-like receptive

networks of nerve fibers, which carry electrical signals into the cell body. The cell body

effectively summations and thresholds these incoming signals. The axon is a single long

fiber, which carries the signal from the cell body out to other neurons. The point of

contact between an axon of one cell and a dendrite of another cell is called a synapse,

which is the arrangement of neurons and the strengths of the individual synapses and

determined by a complex chemical process. The example of the neural network model

is a simplified schematic diagram of two biological neurons as illustrated in figure 4.1

by [10], [11] and [12].

Regularly, some of the neural structure is defined at birth. Other parts are devel-

oped through learning as new connections are reconstruct and others unused connections

are removed. It can be seen that development is most noticeable in the early stages of

– 20 –


Dendrites

AxonSynapse

Cell Body

Fig. 4.1 Schematic Drawing of Biological Neurons.

life. Neural structures continue to change throughout life. These later changes tend to

consist mainly of strengthening or weakening of synaptic junctions, e.g., it is believed

that new memories are formed by modification of these synaptic strengths. Thus, the

process of learning a new friend’s face consists of altering various synapses. Another

example, the hippocampi’s brain of London taxi drivers is significantly larger than aver-

age because of they must memorize a large amount of navigational information, which

process of learn and adapt takes more than two years.

Artificial neural networks have ability to simulate the complexity of the brain with

two key similarities between biological and artificial neural networks. First, the build-

ing blocks of both networks are simple computational devices. In contrast, artificial

neurons are much simpler than biological neurons. Artificial neural network is highly

– 21 –

4.2 Neuron Model

interconnected as well as biological neural network. Second, the connections between

neurons have to perform the function of the network. The determination the appropri-

ate connections to solve particular problems are described, which is primary objective of

this chapter. The biological neurons are very slow when compared to electrical circuits.

The brain is able to perform many tasks much faster than any conventional computer

because the massively parallel structure of biological neural networks. All of the neu-

rons are operating at the same time. Artificial neural networks simulate the parallel

structure as well as human brain. Moreover, their parallel structure makes them ideally

suited to implementation using VLSI, optical devices and parallel processors.

4.2 Neuron Model

4.2.1 General Neuron

A single input neuron is illustrated in figure 4.2 by [12]. The scalar input p is

multiplied by the scalar weight w to form wp, which one of the terms is sent to the

summation. The other scalar input 1 is multiplied by a bias b and then passed to the

summation. The summation output n is referred to the net input, which go into a

transfer function f in order to produces the scalar neuron output a. Another books

use the term activation function rather than transfer function and offset rather than

bias. Generally, this simple model relates to the biological neuron as shown in previous

section. The weight corresponds to the strength of a synapse, the cell body is represented

by the summation and the transfer function and the neuron output represents the signal

on the axon.

All in all, the neuron output is calculated as follow;

a = f(wp+ b) (4.1)

– 22 –

4.2 Neuron Model

1

Fig. 4.2 Single Input Neuron.

1

1 1

2 2

Fig. 4.3 Multiple Input Neuron.

The actual output depends on the particular transfer function, which is selected

by designer. The bias is like a weight except it has a constant input of 1. However, the

particular neuron allows the neuron to omit this bias. The weight w and bias b are both

adjustable scalar parameters of the neuron. Typically, the transfer function is chosen

– 23 –

4.2 Neuron Model

by the designer. The parameters w and b are adjusted by a learning rule , which the

neuron input and output relationship achieves some specific goals.

As a rule, a neuron also available more than one input, which a neuron with R

inputs is illustrated in figure 4.3 by [12]. The individual inputs vector p are each

weighted by corresponding elements of the weight vector w.

The neuron has a bias b, which is summed with the weighted inputs to form the

net input as given below;

n = b+

R∑i=1

wipi (4.2)

This expression can be written in vector form;

n = wTp+ b (4.3)

The output is expressed as;

a = f(wTp+ b) (4.4)

On the whole, the neural networks is described with matrices W . In each rows of

neural weight matrices represents each neuron’s weight vector connection, e.g., a set of

1st neuron’s weight connection is {w1,1, w1,2, w1,3, . . . , w1,i, . . . , w1,R} and 2nd neuron’s

weight connection is {w2,1, w2,2, w2,3, . . . , w2,i, . . . , w2,R}.

Thus, the matrix form is expressed as;

n = Wp+ b (4.5)

Likewise, the output can be expressed as;

a = f(Wp+ b) (4.6)

– 24 –

4.2 Neuron Model

These equations adapt a particular convention in assigning the indices of the ele-

ments of the weight matrix. The first index indicates the particular neuron destination

for that weight. The second index indicates the source of the signal feed to the neuron.

Thus, it can be seen that weight represents the connection to the first neuron from the

second source. This kind of matrix expression is utilized throughout this chapter and

research. Moreover, simple abbreviated notation of neuron can be illustrated as figure

4.4 with note the bias b is already included at output a [12], below;

1 1

2 2

Fig. 4.4 Abbreviated Notation of Neuron.

4.2.2 Transfer Functions

The transfer function in figure 4.1 or figure 4.2 possibly be a linear function or a

nonlinear function of n as given by [10], [11] and [12]. A particular transfer function

is chosen to satisfy the specification of the problem, which the neuron is attempted to

solve. A variety of transfer functions have been included. In this section, a log-sigmoid

and soft-max transfer function are described. First, the log-sigmoid transfer function is

described in figure 4.5.

The log-sigmoid transfer function utilized an input, which have any value between

plus and minus to squashes the output into the range 0 to 1, according to the expression

as follow;

– 25 –

4.2 Neuron Model

a =1

1 + e−n(4.7)

+1

-1

0

+1

-1

0

/

= logsig( ) = logsig( + )

Fig. 4.5 Log-Sigmoid Transfer Function.

= softmax( )

Fig. 4.6 Soft-Max Transfer Function.

The log-sigmoid transfer function is commonly utilized in multilayer neural net-

works, which are trained using the backpropagation algorithm or learning rule because

this function is differentiable. The reason of differentiable function is described in last

section.

Second, the soft-max transfer function is a generalization of the logistic function

in neural transfer function that squashes a vector n of arbitrary real values to a vector

a of real values in the range 0 to 1, which is described in figure 4.6. In neural network

– 26 –

4.2 Neuron Model

simulations, the soft-max function is generally implemented at the final layer of a net-

work in order to develop a classification problem. Such networks are trained under a

log loss or cross-entropy regime, which give a non-linear variant of logistic regression.

Table 4.1 List of Transfer Functions

Transfer Functions Name Input and Output Relation Short Name

Hard Limit a =

0 n < 0,

1 n ≥ 0.hardlim

Symmetrical Hard Limit a =

−1 n < 0,

1 n ≥ 0.hardlims

Linear a = n purelin

Saturating Linear a =

0 n < 0,

n 0 ≤ n ≤ 1,

1 n > 1.

satlin

Symmetric Saturating Linear a =

−1 n < 0,

n 0 ≤ n ≤ 1,

1 n > 1.

satlins

Log-Sigmoid a = 11+e−n logsig

Hyperbolic Tangent Sigmoid a = en−e−n

en+e−n tansig

Positive Linear a =

0 n < 0,

n n ≥ 0.poslin

Soft Max ai =ni∑j e

nj softmax

The soft-max transfer function is given as follow;

ai =ni∑j enj

(4.8)

The soft-max transfer function is the gradient-log-normalizer of the categorical

– 27 –

4.3 Network Architectures

probability distribution. Consequently, the soft-max transfer function is implemented

in various probabilistic multiclass classification methods including multinomial logistic

regression multiclass linear discriminant analysis and a Bayes classifier in artificial neural

networks.

Most of the transfer functions are implemented in worldwide network are summa-

rized in table 4.1 by [10].


4.3.1 A Layer of Neurons

A single-layer network of S neurons is shown in figure 4.7, each of the R inputs is

connected to each of the neuron, which the weight matrix now has S rows.

The layer includes the weight matrix, the output vector a and the transfer function

boxes, which includes the bias vector and summations. Each element of the input vector

p is connected to each neuron through the weight matrix W . Each neuron has a bias

b, a summation, a transfer function f and an output. Altogether, the outputs form the

output vector a. Generally, the input vector elements enter the network through the

weight matrix W is given, below;

W =

w1,1 w1,2 · · · w1,R

w2,1 w2,3 · · · w2,R

......

. . ....

w3,1 w3,2 · · · wS,R

(4.9)

The row indices of the elements of matrix W indicate the destination neuron as-

sociated with that weight, while the column indices indicate the source of the input for

that weight, e.g., the indices in w2,3 say that this weight represents the connection to

the second neuron from the third source.

– 28 –


1

1

2

3

2

1,1

,

Input Output

Fig. 4.7 Single Layer of Neural Networks.

4.3.2 Multiple Layers of Neurons

In the case of the network with several layers, each layer has its own weight matrix W ,

its own bias vector n, a net input vector and an output vector. The number of the layer

as a superscript to the names for each of these variables are appended, e.g., the weight

matrix for the first layer is written as W 1, the weight matrix for the second layer is

written as W 2. This example of this notation is implemented in the three-layer network

as illustrated in figure 4.8.

There are R inputs, S1 neurons in the first layer, S2 neurons in the second layer,

etc. Above all, the different layers allow network to have different numbers of neurons.

The outputs of first layers and second layers are the inputs for layers two and three.

– 29 –


1

1

1

1

1

2

1,11

1 ,1

Input Hidden 1

13 3

23 3

33 3

33 3

2

2

2

2

Hidden 2 Output

2 , 12 3 , 2

3

11

11

1,12

22

12 1,1

3

Fig. 4.8 Three Layer of Neural Networks.

Thus, second layer is imagined as a one-layer network with R = S1 inputs, S = S2

neurons and an S2 × S1 weight matrix W 2. The input to second layer is a1 and the

output is a2. Those layer’s output is the network output is known as an “output layer”,

the other layers are known as “hidden layers”. The network shown above has an output

layer at third layer and two hidden layers at first and second layer.

In can be seen that multilayer networks are more powerful than single-layer net-

works, e.g., a two-layer network having a sigmoid first layer and a linear second layer

is allowed network to trained with approximate most functions arbitrarily well. In con-

trast, the single layer networks unable to offers an acceptable performance. All things

considered, the number of choices to be construct in specifying a network have to care-

fully considered. The number of inputs to the network and the number of outputs from

the network are defined by external problem specifications, e.g., there are four external

– 30 –


variables to be utilized as inputs, then there are four inputs to the network. Similarly,

there are to be seven outputs from the network, then there have to seven neurons in

the output layer. The desired characteristics of the output signal also aid to select the

transfer function for the output layer, e.g., an output is to be either −1 or 1, then a

symmetrical hard limit transfer function is suggested as the transfer function. Thus,

the architecture of a single-layer network is almost completely determined by problem

specifications including the specific number of inputs and outputs and the particular

output signal characteristic.

Of course, the networks allow designer to choose neurons with or without biases.

The bias gives the network an extra variable. Unfortunately, the networks with biases

would be more powerful than those without. Because of that a neuron without a bias

will always have a net input of zero when the network inputs are zero. In the end, the

biases have to implemented without any confusions.

4.3.3 Recurrent Networks

A recurrent neural network is a class of artificial neural network where connec-

tions between units form a directed cycle. The recurrent neural network creates an

internal state of the network which allows it to exhibit dynamic temporal behavior.

Unlike feedforward neural networks, the recurrent neural network utilized their internal

memory to process arbitrary sequences of inputs, which makes the network applicable

to tasks such as unsegmented connected handwriting recognition or speech recognition.

Unfortunately, this network class is unfocussed by cause of problem specifications.

– 31 –

4.4 Performance Optimization

4.4 Performance Optimization

Searching suitable weight matrix W are required, which minimizes the chosen

function F (x) by generate a small step in weight space from W to W +δW to then the

change in the error function δF ∼= δWT∇F (W ) where the matrix ∇F (W ) points in the

direction of greatest rate of increase of the error function. The error F (W ) is a smooth

continuous function of W by its smallest value is occurred at a point in weight space

such that the gradient of the error function vanishes lead to condition ∇F (W ) = 0,

otherwise we calculate a small step in the direction of −∇F (W ) and thereby further

reduce the error. Points at which the gradient vanishes are known as stationary points

and further classified into minima, maxima and saddle points. The objective is to find

a matrix W such that F (W ) takes its smallest value. However, the error function

typically has a highly nonlinear dependence on the weights and bias parameters. There

have many points in weight space, which the gradient vanishes. For this reason, it

clearly seen that optimization is unable to found an analytical solution to the equation

∇F (W ) = 0. The optimization of continuous nonlinear functions is a widely studied

problem and there exists an extensive literature on performance optimization. Most

techniques involve choosing some initial value W (k) for the weight vector and then

moving through weight space in a succession of steps of the form;

W (k + 1) = W (k) + ∆W (k) (4.10)

where k is iteration step. Different algorithms involve different choices for the

weight update ∆W (k) which many algorithms make utilized the gradient information.

The simplest approach to using gradient information is to choose the weight update to

comprise a small step in the direction of the negative gradient is shown, below;

W (k + 1) = W (k)− α∇F (W (k)) (4.11)

– 32 –

4.5 Optimization Method: Backpropagation

where the parameter α > 0 is known as the learning rate. After each such update,

the gradient is re-evaluated for the new weight vector and the process repeated. Note

that the error function is defined with respect to a training set. In each step requires

that the entire training set be processed in order to evaluate ∇F . This simple approach

is known as gradient descent or steepest descent as shown in [9].

There are more efficient methods, such as Conjugate Gradients and Quasi-Newton

algorithm, which are much more robust and much faster than simple gradient descent.

This research utilized Scaled Conjugate Gradient algorithm [13]. This algorithm uti-

lizes 2nd information from the neural network. The performance of Scaled Conjugate

Gradient algorithm is benchmarked against the performance of the Conjugate Gradi-

ent backpropagation and the one-step Broyden-Fletcher-Goldfarb-Shanno memoryless

Quasi-Newton algorithm.


Backpropagation with an abbreviation for “backward propagation of errors” is a

common method of training artificial neural networks utilized in conjunction with an

optimization method such as gradient descent. The method calculates the gradient of a

loss function with respect to all the weights in the network. The gradient is fed to the

optimization method, which in turn uses gradient information to update the weights in

order to minimize the loss function. Importantly, backpropagation is required a desired

output for each input value in order to calculate the loss function gradient.

The backpropagation algorithm was originally introduced in the 1970s. However,

its importance was non-fully appreciated until a famous paper by David Rumelhart, Ge-

offrey Hinton and Ronald Williams since 1986, which describes several neural networks

where backpropagation works far faster than earlier approaches to learning, making

– 33 –


it possible to utilized neural network to solve problems, which had previously been

insoluble. The backpropagation are described in [9], [10], [11], [12], [13] and [14].

4.5.1 Performance Index

The backpropagation algorithm for multilayer networks is a generalization of the

least-mean-squares algorithm.

{p1, t1}, {p2, t2}, {p3, t3}, . . . , {pQ, tQ} (4.12)

where pQ is an input to the network and tQ is the corresponding target output.

As each input is applied to the network and the network output is compared to the

target. The algorithm allows neural network to adjust the network parameters in order

to minimize the mean square error as given below;

F (x) = E[e2] = E[(t− a)2] (4.13)

where x is the vector of network weights and biases. If the network has multiple

outputs this generalizes as follow;

F (x) = E[eTe] = E[(t− a)T(t− a)] (4.14)

Thus, we will approximate the mean square error by;

F (x) = (t(k)− a(k))T(t(k)− a(k)) = e(k)

Te(k) (4.15)

where the expectation of the squared error has been replaced by the squared error

at iteration k. Thus, the example of steepest descent algorithm for the approximate

mean square error given as follow;

wmi,j(k + 1) = wmi,j(k)− α∂F

∂wmi,j(4.16)

– 34 –


bmi (k + 1) = bmi (k)− α∂F

∂bmi(4.17)

where α is the learning rate. The computation of the partial derivatives F is

required which describe in next section.

4.5.2 Chain Rule

For the multilayer network the error is a none explicit function of the weights in the

hidden layers. Therefore, these derivatives are unable to computed directly equation in

(4.13) and (4.14).

Because the error is an indirect function of the weights in the hidden layers, the

chain rule of calculus to calculate the derivatives is employed. To review the chain rule,

suppose that we have a function f that is an explicit function only of the variable n.

We want to take the derivative of f with respect to a third variable w. The chain rule

is then;

df(n(w))

dw=df(n)

dn× dn(w)

dw(4.18)

The concept to find the derivatives in (4.16) and (4.17) are implemented, then;

∂F

∂wmi,j=

∂F

∂nmi× ∂nmi∂wmi,j

(4.19)

∂F

∂bmi=

∂F

∂nmi× ∂nmi∂bmi

(4.20)

The second term in each of these equations can be easily computed, since the net

input to layer m is an explicit function of the weights and bias in that layer;

nmi = bmi +sm−1∑j=1

wmi,jam−1j (4.21)

– 35 –


Therefore;

∂nmi∂wmi,j

= am−1j (4.22)

∂nmi∂bmi

= 1 (4.23)

Suppose that;

smi ≡ ∂F

nmi(4.24)

The sensitivity of to changes in the ith element of the net input at layer, then (4.19)

and (4.20) can be simplified to;

∂F

∂wmi,j= smi a

m−1j (4.25)

∂F

∂bmi= smi (4.26)

The approximate steepest descent algorithm is described as follow;

wmi,j(k + 1) = wmi,j(k)−−αsmi am−1j (4.27)

bmi (k + 1) = bmi (k)− αsmi (4.28)

All in all, case of matrix and vector form this becomes;

Wm(k + 1) = Wm(k)− αsm(am−1)T

(4.29)

bm(k + 1) = bm(k)− αsm (4.30)

– 36 –


sm =

∂F∂nm

1

∂F∂nm

2

∂F∂nm

3

...∂F∂nm

Sm

(4.31)

4.5.3 Sensitivities

The sensitivities sm is concerned importantly, which requires another application

of the chain rule. This process that gives us the term backpropagation, because it

describes a recurrence relationship in which the sensitivity at layer m is computed from

the sensitivity at layer m + 1. In order to derive the recurrence relationship for the

sensitivities, a Jacobian matrix is defined as follow;

∂nm+1

∂nm=

nm+11

nm1

nm+11

nm2

· · · nm+11

nmSm

nm+12

nm1

nm+12

nm2

· · · nm+12

nmSm

......

. . ....

nm+1

Sm+1

nm1

nm+1

Sm+1

nm2

· · ·nm+1

Sm+1

nmSm

(4.32)

In order to find an expression for the Jacobian matrix, consider the i, j element of

the matrix;

∂nm+1i

∂nmj=∂(bm+1

i +∑Sm

l=1 wm+1i,l aml )

∂nmj

∂nm+1i

∂nmj= wm+1

i,j

∂amj∂nmj

∂nm+1i

∂nmj= wm+1

i,j

∂fm(nmj )

∂nmj

∂nm+1i

∂nmj= wm+1

i,j˙fm(nmj )

(4.33)

Therefore, the Jacobian matrix can be written as;

– 37 –


∂nm+1

∂nm= Wm+1 ˙Fm(nm) (4.34)

where;

˙Fm(nm) =

˙fm(nm1 ) 0 · · · 0

0 ˙fm(nm2 ) · · · 0...

.... . .

...

0 0 · · · ˙fm(nmSm)

(4.35)

The recurrence relation for the sensitivity allow to described by using the chain

rule in matrix form;

sm =∂F

∂nm

sm =∂nm+1

∂nm× ∂F

∂nm+1

sm = ˙Fm(nm)(Wm+1)T ∂F

∂nm+1

sm = ˙Fm(nm)(Wm+1)Tsm+1

(4.36)

In order to find the starting point sM , which is allow to obtained at the final layer;

sMi =∂F

∂nMi

sMi =∂((t− a)

T(t− a))

∂nMi

sMi =∂(∑SM

j=1 (tj − aj)2)

∂nMi

sMi = −2(ti − ai)∂ai

∂nMi

sMi = −2(ti − ai)(∂aMi∂nMi

)

sMi = −2(ti − ai)(∂fM (nMi )

∂nMi)

sMi = −2(ti − ai) ˙fM (nMi )

(4.37)

– 38 –


Likewise, this can be expressed in matrix form as;

sM = −2 ˙FM (nM )(t− a) (4.38)

It can be seen that backpropagation algorithm derives its name. The sensitivities

are propagated backward through the network from the last layer to the first layer;

sm → sm−1 → · · · → s2 → s1 (4.39)

At this point it is worth emphasizing that the backpropagation algorithm utilized

the same approximate steepest descent technique that we utilized in the least mean

squares algorithm. The only complication is that in order to compute the gradient we

need to first back-propagate the sensitivities. The beauty of backpropagation is that we

have a very efficient implementation of the chain rule.

IN the end, this research utilized Scaled Conjugate Gradient algorithm. Conse-

quently, backpropagation algorithm is required in Scaled Conjugate Gradient algorithm

in order to minimizes weight matrix W .

– 39 –

Chapter 5

Implementation of Artificial

Neural Network and

Multilevel of Discrete Wavelet

Transform for Voice

Recognition

This chapter presents an implementation of simple Artificial Neural Network model

and multilevel of Discrete Wavelet Transform as feature extractions, which is achieved to

increase the high recognition rates up to 95% instead of Short-time Fourier Transform in

the conversation background noises at noises up to 65 dB. The performance evaluation

has been demonstrated in terms of correct recognition rate, maximum noise power of

interfering sounds, hit rates, false alarm rates and miss rates. The proposed method

offers a potential alternative to intelligence voice recognition system in speech analysis-

synthesis and recognition applications.

– 40 –

5.1 Introduction

5.1 Introduction

During the past 65 years, voice recognition is being extensively implemented for the

classification of sound types. The variety of voice recognition techniques have been de-

veloped to increase the efficiency of recognitive accuracy, statistical pattern recognition,

signal processing and recognition rates as shown in [1].

According to a lot of research, a number of algorithms have been proposed and

suggested as potential solutions to recognize human’s speech, i.e., the simply proba-

bility distribution fitting methods such as, Structural Maximum A Posteriori, Parallel

Model Composition and Maximum Likelihood Linear Regression. However, the issue of

sequential voice input had been being still unsolved.

Ferguson et al. has proposed Hidden Markov Model (HMM) in order to solve an

issue of sequential voice input. HMM was employed double stochastic process using an

embedded stochastic function in order to determine the value of the hidden states as

shown in [15]. High recognition rates design was essentially required state of the art of

architecture in HMM using Gaussian Mixture Model (GMM) as shown in [15] and [16].

GMM has been traditionally utilized voice models for voice recognition using two feature

extractions, a power logarithm of FFT spectrum in order to create Log-power spectrum

feature vectors and Mel-Scale Filter Bank Inverse FFT Dimension Reduction in order

to created Mel Frequency Cepstral Coefficient feature vectors. GMM offered high voice

recognition rate from 60 to 95% in a static environment by comparison with other

machine learning model such as Support Vector Machine and Dual Penalized Logistic

Regression Machine as shown in [15]. Nonetheless, large amounts of computational

resource are required in GMM.

Pitch-Cluster-Maps (PCMs) model was proposed by Yoko et al [17] in order to

replace the complex training sets with Binarized Frequency Spectrum resulted from

– 41 –

5.2 Proposed Voice Recognition

simple codebook sets using Short-time Fourier Transform [18], [19] and [20]. Vector

Quantization Approach method was employed lead suitable Real-Time computation

than GMM. Nonetheless, PCMs offered voice recognition rate up to 60% for 6 sound

sources environment under low frequency resolution.

This chapter aim to propose, the alternative voice recognition utilize Artificial

Neural Network and Multilevel of Discrete Wavelet Transform with 3 main advantages.

First, Discrete Wavelet Transform has resolved the low frequency prediction issue in

order to increases low frequency prediction. Second, the normal conversation back-

ground noise issue resolves in the proposed voice recognition. Last, the proposed voice

recognition has been improved recognition rates up to 95% by comparison with other

model.


The overview of proposed voice recognitions consisted of feature extraction, feature

normalization, machine learning as ANN and decision model which summarized in figure

5.1.

FeatureExtraction

FeatureNormalization

Machine Learning

InputDecision / Selection

Output

Fig. 5.1 The proposed voice recognition overview.

5.2.1 Feature Extraction

The proposed voice recognition utilized the feature extraction as the pre-processing

methods in order to transform the voices or signals to the time-frequency represented

data. Three pre-processing methods were implemented for voices feature extraction

– 42 –


consisted of Short-time Fourier Transform (STFT) and Discrete Wavelet Transform

(DWT). In general case, Continuous Wavelet Transform (CWT) can be expressed as

(3.9) and (3.10) where ψa,b(x) is the conjugate of Wavelet function, a is input scales

which represented as frequency variable, b is input time variable, f(t) is the continuous

signal to be transformed and fψ(a, b) is the CWT of a complex function represented the

magnitude of the continuous signal over time and frequency based on specified Wavelet

function.

In particular, DWT transformation decomposes the signal into mutually orthogonal

set of wavelets, which is the main difference from the CWT or its implementation for the

discrete time series as shown in pervious chapter. DWT provides sufficient information

in both time and frequency with a significant reduction in the computation time than

CWT. DWT can be constructed from convolution of the signal with the impulse response

of the filter expressed as (3.17) and (3.18), redefine as;

ϕ(x) =∞∑

k=−∞

wkϕ(Sx− k) (5.1)

where ϕ(x) is the dilation reference equation as discrete signals from input to output

states, S is a scaling factor to be assign value to 2, x is time index and wk consists of

scaling and wavelet functions obtained from each Mother Wavelet know as Quadrature

Mirror Filter. DWT equation can be represented as a binary hierarchical tree of Low

Pass Filter (LPF) and High Pass Filter (HPF), in other words, it can be defined as

Filter Banks as shown in figure 5.2. In Filter Banks analysis, lengths of discrete signals

are reduced by halved per level. The effect of shifting and scaling process (3.9) and

(5.1) produces a time-scale representation as shown in figure 5.3. The graphs show the

signal amplitudes in both of time and frequency domain using STFT for left-hand graph

and CWT for right-hand graph. The vertical axis is represented frequency band and

horizontal axis is represented time domain. It can be seen from a comparison with STFT

– 43 –


and CWT, Wavelet Transform offers a superior temporal resolution of time resolution

at high frequency components and scale resolution at low frequency components which

usually give a voice signal and its main characteristics or identity.

5.2.2 Feature Normalization

In order to increases the speed convergences of the machine learning algorithm,

Feature Normalization method is utilized with simplest form is given as follow;

x =x−min(x)

max(x)−min(x)− x (5.2)

where x is the normalized vector and x is original vector determine from feature

extraction and x is its offset of signals from zero. Feature Normalization offers the range

of original vector to scale the range between 0 and 1.

5.2.3 Artificial Neural Network Model

Artificial Neuron Network (ANN) is an adaptive system that changes structure

based on external and internal information that flows through the network. ANN is

considered nonlinear statistical data modeling tools where the complex relationships

between inputs and outputs are modeled or patterns are found. Therefore, the proposed

voice recognition utilized ANN in order to recognize a characteristics or identity of

human speech.

The novel network topology name the nth-order All-features-connecting topology

is represented by Hn as illustrated in figure 5.4 where xf is an input vector in each

frequency band which is calculated from feature extraction model and y is a class prob-

ability vector which is calculated by ANN. Hn model utilizes the network of A, B and

C-class in order to construct simple network topology with those network were shown

in table 5.1 The Proposed Neuron Network Architectures. The four main conditions of

– 44 –


in 1( )

LPFHPF

Downsample by 2

HPF LPF

LPFHPF

Downsample by 2

Downsample by 2

out 1( ) out 2

( ) out ( ) in +1( )

in 2( )

in ( )

Fig. 5.2 The Proposed DWT Filter Bank representation.

– 45 –


Time (s) Time (s)F

requen

cy (

Hz)

Fre

quen

cy (

Hz)

4k 4k

3k3k

2k2k

1k 1k

0 0.40.2 0 0.40.2

Fig. 5.3 The comparison of STFT (left) and CWT (right).

the novel network topology are defined. First, numbers of layers are defined from order

of Hn wheren > 0. Second, numbers of input networks are related to number of input

time index, i.e., size of scales vector for CWT and number of levels for DWT. Third,

input networks are required the connection of all single outputs to the first block of

network series. The input, middle and output networks are A, B and C-class network,

respectively, which give as a last condition.

Table 5.1 The Proposed Neuron Network Architectures

Class Name Control Input Control Output Transfer Function

A-class Single Single Log-sigmoid

B-class Multiple Single Log-sigmoid

C-class Single Single Softmax

In order to train specified ANN, Scaled Conjugate Gradient Backpropagation su-

pervised learning rule is employed. Additionally, specified ANN utilizes pre-learning

rule using Autoassociator, to initiate weights approximation of final solution lead to

– 46 –


C

B

AAAA A A A A

A

1 2 3 4 5 6 7 8

3

2

1

Fig. 5.4 All-features-connecting topology.

accelerate the convergence of the error Backpropagation learning algorithm and reduce

dimension from wavelet packet series.

5.2.4 Decision Model

The output of ANN is represented as vector of class possibility value base on feature

set. The decision model is expressed as maximum of class possibility value;

c = argmaxi∈ℵ

(yi) (5.3)

where c is maximum possibility class value and yi is element of y where y =

(y1, y2, . . . , yn)Tin each class number i which is calculated from ANN.

– 47 –

5.3 Experiment Setup


The proposed voice recognition is implemented using MATLAB®. The record-

ing devices utilizes Audio-Technica® AT-VD3 microphone and ROLAND® UA-101 Hi-

speed USB audio capture. The samples select 5 Japanese including 2 youth in both

male and female speakers and 1 middle-age Japanese male speaker. In order to perform

word classification, the speaker pronounces the reference words from International Pho-

netic Alphabet (IPA) [21] datasets which were described in table 5.2. The features set

assigns voice input to 8 kHz sampling frequency, 16-bits data resolution and 8000 sam-

ple points. The numbers of features set are 450 elements obtains from reference words

in dataset with 5 times repeated. In order to perform the performance evaluation, the

experiments are selected 20% for tests set and 80% for features set.

Table 5.2 Features Set for Experimentation

Class Words IPA’s

1 パン pað

2 番 bað

3 先ず mazu

4 太陽 taijo

5 段々 daðdað

6 通知 tsu:tsi

7 何 nani

8 蘭 óað

9 数字 su:si

Class Words IPA’s

10 雑 zatsuzi

11 山 jama

12 脈 mjaku

13 風 kaze

14 外套 gaito:

15 医学 igaku

16 善意 zeði

17 鼻 hana

18 わ wa

– 48 –

5.4 Experimental Results


The performance evaluation was established in term of correct recognition rate which

calculated from the summation of true positive and true negative rates in each class.

Moreover, maximum noise power of interfering sounds with nonlinear logarithmic scale

defines as follows;

Pnoise,dB = 10log10

(Pnoise

Pref

)(5.4)

where Pnoise,dB is noise power level in decibel (dB), Pnoise is noise power level in

watt and Pref is reference power level in watt (W). The experimentation assign Pref is

10−12 W as a reference for ambient noise level in order to map voice signal conditions

over a spatial regime.

Three experiments were conducted with subject to word classification in order to

examine appropriate values of Wavelet and ANN parameters. The first experiment

proposed an examination of Wavelet function category and its order using set of static

parameters shown in table 5.3. The experimental results consisted of three Wavelet

functions included Daubechies, Symlet and Coiflet Wavelet function with each order

from 1 to 16. It is definitely seen from table 5.4 that several Wavelet function achieved

word classification with correct recognition rates greater than 80% and noise power of

interfering sounds greater than 50 dB. TheWavelet function was selected by two satisfied

conditions, maximum values of noise power of interfering sound and correct recognition

rate. Therefore, Daubechies 15 Wavelet function revealed the satisfied maximum values

of noise power of interfering sound 65.5 dB and correct recognition rate 96.22%.

However, cost functions of proposed voice recognition were obviously influenced by

the effect of Wavelet function, Wavelet level and ANN network topology. Hence, the

second experiment was designed to optimize Wavelet level and ANN network topology

using set of static parameters as shown in table 5.5. It is obviously seen from table 5.6

– 49 –


Table 5.3 The First Experimental Configuration

Parameter Name Value

Subject Word Classification

Feature Extraction Method Discrete Wavelet Transform (DWT)

Wavelet Level 6

Wavelet Function variable parameter

Network Topology 3rd-order All-features-connecting topology (H3)

Node Size in Each Layer {1000, 4000, 1000, 18}

Table 5.4 The First Experimental Results

Wavelet Function

OrderDaubechies (db) Symlet (sym) Coiflet (coif)

Pnoise,dB Recognition Rate (%) Pnoise,dB Recognition Rate (%) Pnoise,dB Recognition Rate (%)

1 24.50 90.44 none none 64.50 94.89

2 60.38 93.11 0.00 88.44 62.50 93.33

3 63.63 94.00 63.75 94.22 63.50 96.00

4 64.25 92.44 55.50 94.00 42.50 92.00

5 55.13 94.67 63.25 94.67 none none

6 67.75 94.89 65.25 96.00 none none

7 56.00 93.78 61.00 94.67 none none

8 57.50 95.56 35.50 91.11 none none

9 34.50 92.44 67.50 94.22 none none

10 36.25 93.78 64.25 94.44 none none

11 59.00 94.00 0.00 84.00 none none

12 61.50 95.56 61.25 95.11 none none

13 67.50 95.11 67.75 95.78 none none

14 58.00 95.33 65.25 94.67 none none

15 65.50 96.22 27.75 91.33 none none

16 61.50 96.88 65.75 94.44 none none

– 50 –


Table 5.5 The Second Experimental Configuration



Feature Extraction Method Discrete Wavelet Transform (DWT)

Wavelet Level variable parameter

Wavelet Function Symlet 7 (sym7)

Network Topology variable parameter

Node Size in Each Layer variable parameter

Table 5.6 The Second Experimental Results

Wavelet Level Network Topology Node Size in Each Layer Pnoise,dB Recognition Rate (%)

1 H1 {1000, 18} 0.00 90.89

2 H1 {1000, 18} 33.00 93.56

1 H2 {1000, 1000, 18} 0.00 90.44

2 H2 {1000, 1000, 18} 29.75 93.33

3 H2 {1000, 1000, 18} 39.75 94.00

4 H2 {1000, 1000, 18} 48.50 94.67

1 H3 {1000, 4000, 1000, 18} 0.00 90.22

2 H3 {1000, 4000, 1000, 18} 28.25 92.22

3 H3 {1000, 4000, 1000, 18} 38.25 94.89

4 H3 {1000, 4000, 1000, 18} 52.25 94.44

5 H3 {1000, 4000, 1000, 18} 54.00 94.00

6 H3 {1000, 4000, 1000, 18} 61.00 94.67

7 H3 {1000, 4000, 1000, 18} 60.00 95.33

8 H3 {1000, 4000, 1000, 18} 62.25 95.78

– 51 –


Table 5.7 The Third Experimental Configuration



Feature Extraction Method variable parameter

Wavelet Level 6


STFT Windows Hamming

STFT Time Slot 1 millisecond

STFT Frequency Separation 8



Table 5.8 The Third Experimental Results

Feature Extraction Method Pnoise,dB Recognition Rate (%)

Discrete Wavelet Transform (DWT) 61.00 94.67

Short-time Fourier Transform (STFT) 0.00 88.67

that H3 model with Wavelet level 4 to 8 achieved word classification with noise power

of interfering sounds greater than 60 dB and correct recognition rates greater than

94%. Hence, H3 model with Wavelet level 6 was selected with two satisfied conditions

criteria, minimizing computation and verify the validity inside the ROI in human speech

frequency form 130 to 4 kHz. H3 model with Wavelet level 6 was selected which gives

the maximum values with correct recognition rate 94.67% and noise power of interfering

sound 61 dB.

Finally, the last experiment was designed to verify the hypothesis which Wavelet

Transform feature extraction is suitable for the voice recognition application instead

– 52 –

5.5 Discussions

of STFT as shown in table 5.7 and table 5.8. It is apparent seen that the correct

recognition rates and noise power of interfering sounds in DWT achieved to increase

high recognition rates than of STFT by reason of DWT theoretically employs multi-

resolution lead to offers the main characteristics or identity of voice at low frequency

boundary which depends on Wavelet function and length of input signal.

5.5 Discussions

The summaries of the optimized parameters with both of word and gender classi-

fication were described in table 5.9. It can be seen that the proposed voice recognition

with the optimized parameters offered high correct recognition rate and noise power

were 96.22% and 65.5 dB which sufficient for word classification. Moreover, the pro-

posed voice recognition with the optimized parameters offered the correct recognition

rate and noise power were 99.8% and 72.25 dB which acceptable for gender classification.

Table 5.9 The parameter optimization results

Parameter NameSubject

Word Classification Gender Classification

Feature Extraction Method DWT DWT

Wavelet Level 6 6

Wavelet Function db15 db15

Network Topology H3 H3

Node Size in Each Layer {1000, 4000, 1000, 18} {1000, 4000, 1000, 2}

Pnoise,dB 65.50 72.25

Recognition Rate (%) 96.22 99.80

The proposed voice recognition performance was established in term of the bound-

– 53 –

5.6 Conclusions

ary of hit rate, false alarm and miss rate with gender classification in order to compare

with other models, i.e., simple sound database named Pitch-Cluster-Maps (PCMs). The

performance of PCMs models established in term of Detection Error Tradeoff (DET)

curves with gender classification, in other words, it can be defining as upper and lower

boundary both of false alarm rate and miss rate. The best performance of hit rate

requires set of predicted data which approach to 100% on true positive rate. In con-

trast, the best performance of false alarm and miss rate requires set of predicted data to

approach on the false positive rate and false negative rate being closely equal to 0 and

0%, respectively. Therefore, lower boundary of hit rate, upper boundary of false alarm

rate and upper boundary of miss rate were important for performance evaluation. The

proposed voice recognition performance was shown in table 5.10.

Table 5.10 Performance evaluation

GenderLower Boundary Upper Boundary

Hit Rate (%) False Alarm Rate (%) Miss Rate (%)

Male 99.63 2.78 0.37

Female 97.22 0.37 2.78

5.6 Conclusions

This chapter presented an alternative voice recognition using combination of Arti-

ficial Neural Network and Multilevel of Discrete Wavelet Transform. The experimental

results proved Wavelet Transform was achieved to increases high recognition rates up to

95% instead of Short-time Fourier Transform feature extractions at noises up to 65 dB

as in normal conversation background noises. The performance evaluation was demon-

strated in terms of correct recognition rate, maximum noise power of interfering sounds,

– 54 –

5.6 Conclusions

hit rate, false alarm rate and miss rate. The proposed method offers a potential alterna-

tive to intelligence voice recognition system in speech analysis-synthesis and recognition

applications.

– 55 –

Chapter 6

Reinforced Voice Recognition

using Distributed Artificial

Neural Network with

Time-Scale Wavelet Transform

Feature Extraction

This chapter presents a new voice recognition method named as Signal Clustering

Neural Network by Distributed Artificial Neural Network model with a single chan-

nel microphone and enhancement of Wavelet Transform feature extractions, which is

achieved to increase the high recognition rates up to 95% instead of Short-time Fourier

Transform feature extractions at noises up to 70 dB as in the normal conversation

background noises. The performance evaluation has been demonstrated in terms of cor-

rect recognition rate, maximum noise power of interfering sounds, Receiver Operating

Characteristic and Detection Error Tradeoff curves. The proposed method offers a po-

tential alternative to intelligence voice recognition system in computational linguistics

and speech controlled robot application.

– 56 –

6.1 Introduction

6.1 Introduction

Voice recognition is being extensively implemented for the classification of sound

types. During the past 65 years, voice recognition techniques have been developed to

increase the efficiency of recognitive robustness, statistical pattern recognition, signal

processing and recognition rates as shown in [1]. Repeatedly, a number of algorithms

have been proposed and suggested as potential solutions to recognize human voice pat-

terns, i.e., the simply probability distribution fitting methods such as, Maximum Like-

lihood Linear Regression (MLLR), Structural Maximum A Posteriori (SMAP), Paral-

lel Model Composition (PMC), Hidden Markov Model (HMM) and Gaussian Mixture

Model (GMM).

In 2009, Yoko et al. proposed Pitch-Cluster-Maps (PCMs) model based on Bi-

narized Short-time Fourier Transform. In particular, PCMs employed simple Vector

Quantization Approach method lead to suitable Real-Time computation than GMM.

Nonetheless, PCMs offered voice recognition rate from 50 to 60% for 6 sound sources

environment under low frequency resolution.

In previous chapter, the alternative voice recognition utilizes Artificial Neural Net-

work and Multilevel of Discrete Wavelet Transform was proposed. On the one hand,

the previous proposed voice recognition has been improved recognition rates up to 95%

human region frequency resolution. Discrete Wavelet Transform has resolved the low

frequency prediction issue in order to increases low frequency prediction. The normal

conversation background noise issue is resolved. On the other hand, lengths of discrete

signals are reduced by halved per level in Filter Banks of Discrete Wavelet Transform

lead to unbalance recognition in each frequency band approximations. the previous

proposed voice recognition requires a large amount of computational resource in or-

der to utilizes additional learning rule in previous proposed network topology such as

– 57 –


Autoassociator learning rule.

This paper aim to propose, the new voice recognition named Sound Clustering

Neural Network (SCNN) with resolve 2 main issues. SCNN is implemented using Dis-

tributed Artificial Neural Network lead to intelligent computational resource manage-

ment. SCNN has been resolved the unbalance frequency prediction issue by using the

Time-scale Wavelet Transform. The normal conversation background noise issue is

resolved as well as the previous proposed voice recognition. Last, SCNN has been im-

proved recognition rates up to 95% by compare to previous proposed voice recognition

and other model.


The overview of proposed voice recognitions consisted of new feature extraction,

feature normalization, machine learning as Distributed Artificial Neural Network and

decision model which already summarized in previous chapter figure 5.1 which new

feature extraction and Distributed Artificial Neural Network explains in the next section.

6.2.1 New Feature Extraction

The proposed voice recognition utilized the feature extraction as the pre-processing

methods in order to transform the voices or signals to the time-frequency represented

data. Three pre-processing methods were implemented for voices feature extraction

consisted of Short-time Fourier Transform (STFT), Discrete Wavelet Transform (DWT)

and Time-scaled Discrete Wavelet Transform (TSDWT).

DWT can be constructed from convolution of the signal with the impulse response

of the filter expressed as (5.1) where ϕ(x) is the dilation reference equation as discrete

signals from input to output states, S is a scaling factor to be assign value to 2, x is

– 58 –


time index and wk consists of scaling and wavelet functions obtained from each Mother

Wavelet know as Quadrature Mirror Filter. Moreover, DWT equation can be defined

as Filter Banks as illustrated in figure 5.2.

In Filter Banks analysis, lengths of discrete signals are reduced by halved per

level. In order to keep information identity in discrete signals, time scale modification

is proposed, named Time-scaled Discrete Wavelet Transform (TSDWT) expressed as;

ϕout(x) =

{ϕout

(x2

)if x is even,

ϕout(x+12

)otherwise.

(6.1)

where ϕout(x) is the dilation equation references as post evaluation of discrete

signals in (5.1). ϕout(x) is discrete time-scaled signals. It can be seen from a comparison

with the STFT, TSDWT offers a superior temporal resolution of the low and high

frequency components as well as DWT. Likewise, TSDWT can be defined as Filter

Banks as shown in figure 6.1. The proposed model allows the low frequency components,

which usually give a voice signal and its main characteristics or identity, as shown in

figure 6.2. The graphs show the Wavelet coefficients as signal amplitudes in both of

time and frequency domain using CWT for left-hand graph and TSDWT for right-hand

graph. The both of CWT Scales and TSDWT levels are represented as frequency band.

6.2.2 Distributed Artificial Neural Network Model

The proposed voice recognition utilized ANN construction with multiple hidden

layers of units between the input and output layers with previous three and new two

different classes as shown in table 6.1.

The new novel network topology name the nth-order Double-features-connecting

topology is represented by Pn as illustrated in figure 6.3 where xf is an input vector

in each frequency band which is calculated from feature extraction model and y is a

class probability vector which is calculated by ANN. On the one hand, the Hn model

– 59 –


in 1( )

LPFHPF

Downsample by 2

HPF LPF

LPFHPF

Downsample by 2

Downsample by 2

out 1( ) out 2

( ) out ( )

in +1( )

in 2( )

in ( )

Multi-order Time-scale

Fig. 6.1 The Proposed TSDWT Filter Bank representation.

– 60 –


Time (s) Time (s)Lev

el

Sca

le

7

0 0.40.2 0 0.40.2

6

54

1

120

100

80

60

40

20

Fig. 6.2 The comparison CWT (left) and TSDWT (right).

Table 6.1 The Proposed Neuron Network Architectures

Class Name Control Input Control Output Transfer Function

A-class Single Single Log-sigmoid

B-class Multiple Single Log-sigmoid

C-class Single Single Softmax

D-class Double Single Log-sigmoid

E-class Double Single Softmax

utilizes the network of A, B and C-class to construct simple network topology from

four conditions. First, numbers of layers are defined from order of Hn wheren > 0.

Second, numbers of input networks are related to number of input time index, i.e., size

of scales vector for CWT and numbers of levels for TSDWT. Third, input networks are

required the connection of all single outputs to the first block of network series. The

input, middle and output networks are A, B and C-class network, respectively, which

give as a last condition as shown in figure 5.4. On the other hand, the Pn model utilizes

– 61 –


the network of A, D and E-class to construct binary hierarchical network topology lead

to parallel computing ability and minimizing computation form four conditions. First,

numbers of layers are defined from order of Pn withn > 0. Moreover, numbers of layers

are defined as 2n. Second, numbers of input networks are related to number of time

index input. Third, layer networks are required to connect as binary tree structure. The

input, middle and output network are defined as A, D and E-class network, respectively,

which give as a last condition.

D

E

DD D

D

D

A A A A A A A A

1 2 3 4 5 6 7 8

3

2

1

Fig. 6.3 Double-features-connecting topology.


The proposed voice recognition utilizes development tools as well as the previous

experimentation. Likewise, the samples select 5 Japanese including 2 youth in both male

and female speakers and 1 middle-age Japanese male speaker. The speaker pronounces

– 62 –


the reference words from International Phonetic Alphabet (IPA) datasets which were

described in table 5.2. The features set assigns voice input to 8 kHz sampling frequency,

16-bits data resolution and 8000 sample points. The numbers of features set are 450

elements obtains from reference words in dataset with 5 times repeated. In order to

perform the performance evaluation, the experiments are selected 20% for tests set and

80% for features set.


The performance evaluation was established in term of correct recognition rate

which calculated from the summation of true positive and true negative rates in each

class. Moreover, maximum noise power of interfering sounds with nonlinear logarithmic

scale was measured as follows in (5.4).

Table 6.2 The First Experimental Configuration



Feature Extraction Method Time-scale Discrete Wavelet Transform (TSDWT)

Wavelet Level variable parameter

Wavelet Function Daubechies 16 (db16)

Network Topology variable parameter

Node Size in Each Layer variable parameter

Two system of word classification experiments were conducted in order to examine

the appropriate values of proposed voice recognition parameters. The cost functions of

proposed voice recognition were obviously influenced by the effect of Wavelet function,

Wavelet level and ANN network topology. Hence, the first experiment was designed

– 63 –


Table 6.3 The First Experimental Results

Wavelet Level Network Topology Node Size in Each Layer Pnoise,dB Recognition Rate (%)

1 H1 {1000, 18} 0.00 90.89

2 H1 {1000, 18} 33.00 93.56

1 H2 {1000, 1000, 18} 0.00 90.44

2 H2 {1000, 1000, 18} 29.75 93.33

3 H2 {1000, 1000, 18} 39.75 94.00

4 H2 {1000, 1000, 18} 48.50 94.67

1 H3 {1000, 4000, 1000, 18} 0.00 90.22

2 H3 {1000, 4000, 1000, 18} 28.25 92.22

3 H3 {1000, 4000, 1000, 18} 38.25 94.89

4 H3 {1000, 4000, 1000, 18} 52.25 94.44

5 H3 {1000, 4000, 1000, 18} 54.00 94.00

6 H3 {1000, 4000, 1000, 18} 61.00 94.67

7 H3 {1000, 4000, 1000, 18} 60.00 95.33

8 H3 {1000, 4000, 1000, 18} 62.25 95.78

2 P1 {1000, 18} 36.75 93.78

4 P2 {1000, 500, 18} 52.75 94.67

8 P3 {1000, 1000, 500, 18} 65.50 94.00

to optimize Wavelet level and ANN network topology using set of static parameters

as shown in table 6.2. The experimental results consisted of two network topologies,

nth-order All-features and Double-features connecting topology with each Wavelet level

from 1 to 8. It is apparently seen from table 6.3 that H3 model with Wavelet level 4

to 8 and P3 model achieved word classification with correct recognition rates greater

than 95% and noise power of interfering sounds greater than 50 dB. Hence, H3 model

with Wavelet level 6 and P3 model were selected with three satisfied condition criteria,

minimizing computation, parallel computing ability and verify the validity inside the

– 64 –


region of interest in both of male and female human speech frequencies form 130 to

3.5 kHz and 250 to 4 kHz, respectively. H3 model with Wavelet level 6 was selected

which gives the maximum values with correct recognition rate 95.56% and noise power

of interfering sound 67.5 dB.

The second experiment was designed to prove the Wavelet Transform feature ex-

traction is suitable for the voice recognition application instead of STFT as shown in

table 6.4 and table 6.5. It is obviously seen that the correct recognition rates and noise

power of interfering sounds in both of DWT and TSDWT achieved to increase high

recognition rates than of STFT feature extractions as the hypothesis. TSDWT and

DWT were theoretically employs multi-resolution lead to offers higher accuracy at low

frequency band which depends on Wavelet function and length of input signal. Never-

theless, TSDWT normalizes the priority of frequency range using time scale modification

instead of original DWT as shown in (5.1) , (6.1).

Table 6.4 The Second Experimental Configuration



Feature Extraction Method variable parameter

Wavelet Level 6


STFT Windows Hamming

STFT Time Slot 1 millisecond

STFT Frequency Separation 8



– 65 –

6.5 Discussions

Table 6.5 The Second Experimental Results

Feature Extraction Method Pnoise,dB Recognition Rate (%)

Time-scale Discrete Wavelet Transform (TSDWT) 68.60 96.22

Discrete Wavelet Transform (DWT) 61.00 94.67

Short-time Fourier Transform (STFT) 0.00 88.67

6.5 Discussions

The summaries of the optimized parameters with two subject classification were

described in table 6.6. It can be seen that the proposed voice recognition with the op-

timized parameters offered high correct recognition rates and noise power were 95.56%

and 71.25 dB, respectively, which sufficient for word classification. For gender classifica-

tion, the proposed voice recognition with the optimized parameters offered the correct

recognition rates and noise power were 99.33% and 62.5 dB.

By comparison with the other models, i.e., simple sound database named Pitch-

Cluster-Maps (PCMs) [17] based on Vector Quantization approach, the performance of

models established in term of Receiver Operating Characteristic (ROC) and Detection

Error Tradeoff (DET) curves using subject to gender classification. The best perfor-

mance of ROC requires set of predicted data to approach on the false positive and true

positive rates axis being closely equal to (0, 1) on the graph. The best performance

of DET curve requires a set of predicted data which approach to (0, 0) on the false

alarm probability and miss probability axis. Therefore, ROC graph of the proposed

voice recognition offers a higher performance which is sufficient for gender classification

as shown in figure 6.4 (a). In comparison with DET curve of the proposed voice recog-

nition named Signal Clustering Neural Network (SCNN) and PCMs, it can be seen from

in figure 6.4 (b) that accuracy of voice recognition has been improved by utilize SCNN.

– 66 –

6.5 Discussions

On the one hand, PCMs offered miss probability range from 2 to 20% and false

alarm probability range from 1 to 20% for subject to female classification, which ap-

propriate for the female speaker identification from several words utterance. Likewise,

subject to male classification, PCMs offered miss probability range from 2 to 12% and

false alarm probability range from 1 to 10%. On the other hand, SCNN offered miss

probability range from 0.5 to 15% and false alarm probability range from 0.4 to 1%

for female classification, which reduce the false alarm rates leads to high accuracy of

non-female classification. For the male subjects, SCNN offered miss probability range

from 0.3 to 1% and false alarm probability range from 0.5 to 50%, which is reduced

the miss rates leads to high accuracy of male classification. In contrast, the accuracy of

non-male classification is decreased since speech phase shift occurred and the variations

of feature sets are required in order to potential training the proposed voice recognition.

Table 6.6 The parameter optimization results.

Parameter NameSubject

Word Classification Gender Classification

Feature Extraction Method TSDWT TSDWT

Wavelet Level 6 6

Wavelet Function db15 db15

Network Topology H3 H3

Node Size in Each Layer {1000,4000,1000,18} {1000,4000,1000,2}

P(noise,dB) 71.25 62.50

Recognition Rate (%) 95.56 99.33

– 67 –

6.5 Discussions

False Alarm Rate

0 0.02 0.04 0.06 0.08 0.1 0.12

Hit R

ate

0.88

0.9

0.92

0.94

0.96

0.98

1

MaleFemale

(a) Receiver Operating Characteristic.

False Alarm Probability (%)

0.1 0.2 0.5 1 2 5 10 20 40

Mis

s P

robab

ility (

%)

0.1

0.2

0.5

1

2

5

10

20

40 MaleFemale

(b) Detection Error Tradeoff curves

Fig. 6.4 Performance Evaluation.

– 68 –

6.6 Conclusions

6.6 Conclusions

This chapter presented a new voice recognition method named Signal Clustering

Neural Network by simple Artificial Neural Network model with single channel micro-

phone and Wavelet Transform feature extractions. The experimental results proved

Wavelet Transform which achieved a high recognition rates up to 95% instead of Short-

time Fourier Transform feature extractions at noises up to 70 dB as in normal con-

versation background noises. The performance evaluation had been demonstrated in

terms of correct recognition rates, maximum noise power of interfering sounds, Receiver

Operating Characteristic and Detection Error Tradeoff curves. The proposed method

offers a potential alternative to intelligence voice recognition system in computational

linguistics and speech controlled robot application.

– 69 –

Chapter 7

Conclusions

This dissertation consisted of introduction of human speech and speaker recogni-

tion, Wavelet theory, Neural Network theory, paper of implementation of artificial neural

network and multilevel of discrete wavelet transform for voice recognition and paper of

reinforced voice recognition using distributed artificial neural network with time-scale

wavelet transform feature extraction which available in chapter 2, chapter 3, chapter 4,

chapter 5 and chapter 6, respectively.

First, the brief research’s history in automatic speech and speaker recognition was

presented in past 65 years in order to provide a technological perspective and an ap-

preciation of the fundamental progress that has been accomplished in this important

area of speech communication in chapter 2. Many techniques have been developed and

sufficient for exhibited robust recognition. However, many challenges have yet to be

overcome before we can achieve the ultimate goal of creating machines that can com-

municate naturally with humans. The satisfactory performance under a broad range of

operating conditions was required in order to archive machine can communicate nat-

urally with humans. This research focuses on the improvement of Feature Extraction

by using Wavelet transform instead of Fourier transform which described the reasons

of replacement in chapter 3. The speaker Adaptation and Speech Understanding is

improving by famous deep learning named Artificial Neural Network which described

chapter 4. The proposed system behavior can be characterized by using the new spec-

trum of the input signal. This proposed research aims to develop the ability to classify

– 70 –

command input signals, improve accuracy of voice command recognition and support

the voice recognition of economical robots which shown in chapter 5 and chapter 6.

In chapter 3, a visual representation of the spectrum of frequencies in a sound or

other signal named spectrogram was described. The advantageous development tools to

analyze the frequency components of the signal called Fourier transform and Short-time

Fourier transform was discussed. Short-time Fourier transform uses a sliding window

in order to calculate the spectrogram, which gives the information of both time and

frequency. Nonetheless, the length of window limits the resolution in frequency was

occurred lead to uncertainty information each frequency band. Representation as shown

in Heisenberg’s uncertainty principles. In order to solve window limits the resolution

issues, Wavelet transform potential alternative suggest a solution. Wavelet transforms

are based on small wavelets with limited duration. The translated-version wavelets

locate where we concern. Whereas the scaled-version wavelets allow us to analyze the

signal in different scale as shown in Heisenberg’s uncertainty principles.

In chapter 4, an introduction to fundamental of neural network, learning rules,

network architectures, mathematical analysis of these networks and their application to

practical engineering problems or potential solution were presented which motivated by

the recognition of the brain computes in an entirely different way from the conventional

digital computer. This models offers a potential alternative to intelligence solution such

as nonlinear regression, pattern recognition, signal processing, data mining, control

systems and real world problems as well as human voice recognition.

In chapter 5 presented an alternative voice recognition using combination of Arti-

ficial Neural Network and Multilevel of Discrete Wavelet Transform. The experimental

results proved Wavelet Transform was achieved to increases high recognition rates up to

95% instead of Short-time Fourier Transform feature extractions at noises up to 65 dB

as in normal conversation background noises. The performance evaluation was demon-

– 71 –

strated in terms of correct recognition rate, maximum noise power of interfering sounds,

hit rate, false alarm rate and miss rate. The proposed method offers a potential alterna-

tive to intelligence voice recognition system in speech analysis-synthesis and recognition

applications.

Finally, the chapter 6 presented a new voice recognition method named Signal

Clustering Neural Network by simple Artificial Neural Network model with single chan-

nel microphone and Wavelet Transform feature extractions. The experimental results

proved Wavelet Transform which achieved a high recognition rates up to 95% instead

of Short-time Fourier Transform feature extractions at noises up to 70 dB as in normal

conversation background noises. The performance evaluation had been demonstrated in

terms of correct recognition rates, maximum noise power of interfering sounds, Receiver

Operating Characteristic and Detection Error Tradeoff curves. The proposed method

offers a potential alternative to intelligence voice recognition system in computational

linguistics and speech controlled robot application.

– 72 –

Acknowledgement

Firstly, the author is most grateful and foremost to his advisor, Prof. Dr. Masahiro

Fukumoto, for his valuable supervision, supports, encouragements throughout the study

and mentoring me over the course of my graduate studies. His insight lead to the original

proposal to examine the possibility of Wavelet transform for feature extraction in voice

recognition and ultimately lead to the publish in an honorable Springer book ”Computer

and Information Science” series, “Studies in Computational Intelligence 656” in 2016.

He has helped me through extremely difficult times over the course of the analysis and

the revising of the dissertation and for that I sincerely thank him for his confidence in

me. I would additionally like to thank Assoc. Prof. Shinichi Yoshida for his support

in both the research and especially the life in Japan. His knowledge and understanding

of the machine learning fields has allowed me to fully express the concepts behind this

research. Also, grateful acknowledgements are also made to Prof. Dr. Toru Kurihara,

members of dissertation committee, for their valuable suggestions and comments.

The author wishes to acknowledges Prof. Lawrie Hunter and Prof. Paul Daniels

for his valuable guidance in research writing. the author wishes to acknowledges to Ms.

Sonoko Fukudome and Ms. Kubo Mariko, members of International Relation Center,

for their administrative supports and Japanese language instructions. The author also

wishes to acknowledge Kochi University of Technology, Ministry of Education, Culture,

Sports, Science and Technology (MEXT), and Japan Student Services Organization

(JASSO) for a great opportunity of financial supports.

This research would not have been possible without the assistance of the laboratory

member who constructed the experimental apparatus and built the foundations for the

data analysis. Sincere appreciation is also extended to all Japanese friends in Kochi and

– 73 –

Acknowledgement

to colleagues in the Signal Processing & New Generation Network Laboratory for their

useful technical experience sharing and kind technical assistance. The author sincerely

appreciates all of his Thai friends in Kochi for their friendships and goodwill.

Finally, I would like to thank my best friend, he is just like a brother. He definitely

provides me with the tools that I needed to choose the right direction and successfully

complete my dissertation and research’s paper. Thank you to Mr. Chiramathe Nami

who shared this journey with me. Without him, I could never have completed many

research’s paper. Also, I would like to thank Thomas J. Bergersen and Nick Phoenix

from “Two Steps from Hell” for his immeasurable bravery music during these two years

of studies in university. Thank you for makes me braver when I have fallen in the

dark. I would like to extend my deepest gratitude to my mother and my little cute

brother without whose love, support and understanding I could never have completed

this master degree. Thank you, Mom, brother, always and forever.

– 74 –

References

[1] S. Furui, “50 years of progress in speech and speaker recognition”, SPECOM2005,

pp.1–9, Patras, Greece, 2005.

[2] S.G. Mallat, “A theory for multiresolution signal decomposition: the wavelet rep-

resentation”, IEEE Pattern Anal. and Machine Intell., vol. 11, no. 7, pp 674–693,

1989.

[3] C. Valens, “A Really Friendly Guide to Wavelets”, 1999.

[4] R. Schneider and F. Kruger, “Daubechies Wavelets and Interpolating Scaling Func-

tions and Application on PDEs”, Technical University of Berlin, 22 November 2007.

[5] E. Johansson, “Wavelet Theory and some of its Applications”, Dissertation, Lulea

University of Technology, February 2005.

[6] W. Hereman, “WAVELETS: Theory and Applications An Introduction”, Disserta-

tion, University of Antwerp, 4-15 December 2000.

[7] R. Polikar, “The Wavelet Tutorial Second Edition”, Handout, Rowan University, 5

November 2000.

[8] C. Liu, “A Tutorial of the Wavelet Transform”, Department of Electrical Engineer-

ing, National Taiwan University, 23 February 2010.

[9] C.M. Bishop, “Pattern Recognition and Machine Learning (Information Science

and Statistics)”, Springer-Verlag New York, ISBN.978-0-387-31073-2, 2006.

[10] B. Krose et al, “An Introduction to Neural Network”, University of Amsterdam,

1996.

[11] R. Rojas, “Neural Networks, A Systematic Introduction”, Springer-Verlag, Berlin,

1996.

[12] M.T. Hagan et al, “Neural Networks Design 2nd Edition”, ISBN-0971732116, 2014.

– 75 –

References

[13] M.F. Møller, “A scaled conjugate gradient algorithm for fast supervised learning”,

Neural Networks, Volume 6, Issue 4, pp. 525–533 1993.

[14] Y Bengio, P Lamblin, D Popovici, H Larochelle, “Greedy Layer-Wise Training of

Deep Networks”, Advances in neural information processing systems 19, 2007.

[15] T. Matsui and K. Tanabe, “Comparative Study of Speaker Identification Methods:

dPLRM, SVM and GMM”, IEICE Trans. INFOMATION. & SYSTEM, Vol.E89–D,

No.3 March 2006.

[16] N. Roman and D. Wang. “Pitch-based monaural segregation of reverberant speech”,

Journal of Acoustics Sciety of America, Vol. 120, No. 1, pp.458–469, July 2006.

[17] Y. Sasaki et al, “Pitch-Cluster-Map Based Daily Sound Recognition for Mobile

Robot Audition”, Journal of Robotics and Mechatronics, Vol.22 No.3, 2010.

[18] “A propagation approach to modelling the joint distributions of clean and cor-

rupted speech in the Mel-Cepstral domain”, Automatic Speech Recognition and

Understanding (ASRU), 2013 IEEE Workshop, pp.180-185, Czech, 8-12 December

2013.

[19] R.F. Astudillo, “Exemplar-Based Speech Enhancement For Deep Neural Network

Based Automatic Speech Recognition”, Acoustics, Speech and Signal Processing

(ICASSP), 2015 IEEE International Conference, pp. 4485-4489, Australia, 19-24

April 2015.

[20] R.F. Astudillo, “An Extension of STFT Uncertainty Propagation for GMM-Based

Super-Gaussian a Priori Models”, IEEE, Signal Processing Letters, Vol.20, No.12,

December 2013.

[21] “Handbook of the International Phonetic Association: A Guide to the Use of the

International Phonetic Alphabet”, Cambridge University Press, ISBN-0521637511,

1999.

– 76 –

voice recognition using distributed arti cial neural … master’s thesis voice recognition using...

Documents