voice recognition using distributed arti cial neural … master’s thesis voice recognition using...
TRANSCRIPT
2016
Master’s thesis
Voice Recognition using
Distributed Artificial Neural
Network for Multiresolution
Wavelet Transform Decomposition
1187001 Bandhit Suksiri
Advisor Prof. Masahiro Fukumoto
August 2016
Course of Information Systems Engineering
Graduate School of Engineering, Kochi University of Technology
Abstract
Voice Recognition using Distributed Artificial Neural
Network for Multiresolution Wavelet Transform
Decomposition
Bandhit Suksiri
This paper presents a new voice recognition method named as Signal Clustering
Neural Network by simple ANN model with a single channel microphone and Wavelet
Transform feature extractions, which is achieved to increase the high recognition rates
up to 95 per cent instead of Short-time Fourier Transform feature extractions at noises
up to 70 dB as in the normal conversation background noises. The performance evalua-
tion has been demonstrated in terms of correct recognition rate, maximum noise power
of interfering sounds, Receiver Operating Characteristic and Detection Error Tradeoff
curves. The proposed method offers a potential alternative to intelligence voice recog-
nition system in computational linguistics and speech controlled robot application.
key words Discrete Wavelet Transform, Voice Recognition, Feature Extractions,
Artificial Neural Network, Signal Clustering Neural Network.
– i –
Contents
Chapter 1 Preface 1
Chapter 2 Introduction of Human Speech and Speaker Recognition 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Summary of the Technology Progress . . . . . . . . . . . . . . . . . . . . 5
Chapter 3 Wavelet Theory 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Frequency Analysis using Fourier Transform . . . . . . . . . . . . 10
3.2.2 Time-Frequency Analysis using Fourier Transform . . . . . . . . 11
3.3 Fundamental of Wavelet Transform . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Continuous Wavelet Transform . . . . . . . . . . . . . . . . . . . 13
3.3.2 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . 14
3.4 Heisenberg ’s Uncertainty Principles . . . . . . . . . . . . . . . . . . . . 16
Chapter 4 Neural Network Theory 19
4.1 Biological Inspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 General Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Transfer Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 A Layer of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . 28
– ii –
Contents
4.3.2 Multiple Layers of Neurons . . . . . . . . . . . . . . . . . . . . . 29
4.3.3 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Optimization Method: Backpropagation . . . . . . . . . . . . . . . . . . 33
4.5.1 Performance Index . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.2 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.3 Sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 5 Implementation of Artificial Neural Network and Multi-
level of Discrete Wavelet Transform for Voice
Recognition 40
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Proposed Voice Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.3 Artificial Neural Network Model . . . . . . . . . . . . . . . . . . 44
5.2.4 Decision Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 6 Reinforced Voice Recognition using Distributed Artifi-
cial Neural Network with Time-Scale Wavelet
Transform Feature Extraction 56
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Proposed Voice Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 58
– iii –
Contents
6.2.1 New Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.2 Distributed Artificial Neural Network Model . . . . . . . . . . . 59
6.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 7 Conclusions 70
Acknowledgement 73
References 75
– iv –
List of Figures
3.1 Schematic representation of the discrete wavelet transform. . . . . . . . 17
3.2 Heisenberg boxes of wavelets (left) and STFT (right). . . . . . . . . . . 18
4.1 Schematic Drawing of Biological Neurons. . . . . . . . . . . . . . . . . . 21
4.2 Single Input Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Multiple Input Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Abbreviated Notation of Neuron. . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Log-Sigmoid Transfer Function. . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Soft-Max Transfer Function. . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.7 Single Layer of Neural Networks. . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Three Layer of Neural Networks. . . . . . . . . . . . . . . . . . . . . . . 30
5.1 The proposed voice recognition overview. . . . . . . . . . . . . . . . . . 42
5.2 The Proposed DWT Filter Bank representation. . . . . . . . . . . . . . 45
5.3 The comparison of STFT (left) and CWT (right). . . . . . . . . . . . . 46
5.4 All-features-connecting topology. . . . . . . . . . . . . . . . . . . . . . . 47
6.1 The Proposed TSDWT Filter Bank representation. . . . . . . . . . . . . 60
6.2 The comparison CWT (left) and TSDWT (right). . . . . . . . . . . . . . 61
6.3 Double-features-connecting topology. . . . . . . . . . . . . . . . . . . . . 62
6.4 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
– v –
List of Tables
2.1 State-of-the-art Level of ASR techniques . . . . . . . . . . . . . . . . . . 7
4.1 List of Transfer Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 The Proposed Neuron Network Architectures . . . . . . . . . . . . . . . 46
5.2 Features Set for Experimentation . . . . . . . . . . . . . . . . . . . . . . 48
5.3 The First Experimental Configuration . . . . . . . . . . . . . . . . . . . 50
5.4 The First Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 The Second Experimental Configuration . . . . . . . . . . . . . . . . . . 51
5.6 The Second Experimental Results . . . . . . . . . . . . . . . . . . . . . . 51
5.7 The Third Experimental Configuration . . . . . . . . . . . . . . . . . . . 52
5.8 The Third Experimental Results . . . . . . . . . . . . . . . . . . . . . . 52
5.9 The parameter optimization results . . . . . . . . . . . . . . . . . . . . . 53
5.10 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1 The Proposed Neuron Network Architectures . . . . . . . . . . . . . . . 61
6.2 The First Experimental Configuration . . . . . . . . . . . . . . . . . . . 63
6.3 The First Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 The Second Experimental Configuration . . . . . . . . . . . . . . . . . . 65
6.5 The Second Experimental Results . . . . . . . . . . . . . . . . . . . . . . 66
6.6 The parameter optimization results. . . . . . . . . . . . . . . . . . . . . 67
– vi –
Chapter 1
Preface
This dissertation presents an implementation of simple Artificial Neural Network
model and multilevel of Discrete Wavelet Transform as feature extractions, which is
achieved to increase the high recognition rates up to 95% instead of Short-time Fourier
Transform in the conversation background noises at noises up to 65 dB. The performance
evaluation has been demonstrated in terms of correct recognition rate, maximum noise
power of interfering sounds, hit rates, false alarm rates and miss rates. Furthermore,
this research presents a new voice recognition method named Signal Clustering Neural
Network by simple Artificial Neural Network model with a single channel microphone
and Wavelet Transform feature extractions, which is achieved to increase the high recog-
nition rates up to 95% instead of Short-time Fourier Transform feature extractions and
previous proposed feature extractions at noises up to 70 dB as in the normal conversa-
tion background noises. The performance evaluation has been demonstrated in terms of
correct recognition rate, maximum noise power of interfering sounds, Receiver Operat-
ing Characteristic and Detection Error Tradeoff curves. Those proposed methods offer
a potential alternative to intelligence voice recognition system in computational lin-
guistics, speech controlled robot application, speech analysis-synthesis and recognition
applications.
This dissertation consisted of five different chapters. First, a brief research’s histo-
ries in automatic speech and speaker recognition are presented in past 65 years in order
to provide a technological perspective and an appreciation of the fundamental progress
– 1 –
that was accomplished in important area of speech communication, which described in
chapter 2 on the following page. Many techniques have been developed and sufficient
for exhibited robust recognition. However, many challenges have yet to be overcome
before the ultimate goal of creating machines that can communicate naturally with
humans is achieved. The satisfactory performance under a broad range of operating
conditions are required in order to archive the machine communicate naturally with hu-
mans. This research focuses on the improvement of feature extraction by using Wavelet
transform instead of Fourier transform, which described the reasons of replacement in
chapter 3 on page 9. The speaker adaptation and speech understanding is improved
by famous machine learning named Artificial Neural Network, which describe in chap-
ter 4 on page 19. This chapter gives an introduction to fundamental of neural network,
learning rules, network architectures, mathematical analysis of these networks, and on
their application to practical engineering problems or potential solution, such as non-
linear regression, pattern recognition, signal processing, data mining, control systems
and real world problems. Last importantly chapters, the proposed feature extraction
behavior is allow to characterized input signal by utilizes Wavelet transform instead
of Fourier transform. This research aims to develop the ability to classify command
input signals, improve accuracy of voice command recognition and support the voice
recognition of intelligence voice recognition system in computational linguistics, speech
controlled robot application, speech analysis-synthesis and recognition applications by
using Distributed Artificial Neural Network, which shown in chapter 5 on page 40 and
chapter 6 on page 56.
– 2 –
Chapter 2
Introduction of Human Speech
and Speaker Recognition
This chapter, we present brief research’s history in automatic speech and speaker
recognition. This chapter surveys the major themes and advances made in the past 65
years of research in order to provide a technological perspective and an appreciation of
the fundamental progress that has been accomplished in this important area of speech
communication. On the one hand, the outstanding techniques have been developed. On
the other hand, many challenges have yet to be overcome before the ultimate goal of
creating machines are achieved, which ability communicate naturally with humans. The
machines have to deliver a satisfactory performance under a broad range of operating
conditions.
2.1 Introduction
Speech is the primary means of communication between humans. Many evidence
feeding from technological inquisitively researcher about the mechanisms for mechanical
realization of human speech capabilities to the desire to automate simple tasks, which ne-
cessitate human-machine interactions, research in automatic speech and speaker recog-
nition by machines has attracted a great deal of attention. Based on major and branch,
statistical modeling of speech, automatic speech recognition systems and many extensive
– 3 –
2.2 Speech recognition
application work with voice are required human-to-machine interface, e.g., automatic
call processing in telephone networks and query-based information systems that provide
updated travel information, stock price quotations and weather reports.
Reference to chapter’s title, the “Human Speech and Speaker Recognition” is refer
to two type of voice recognition, “Speaker Recognition”, which determine who is speak
and “Speech recognition”, which determine what is being said or words, which explains
in the next section.
2.2 Speech recognition
The speech recognition is known as Automatic Speech Recognition (ASR), com-
puter speech recognition or Speech-to-Text. A speech recognition is the interdisciplinary
subfield of computational linguistics, which incorporates knowledge and research in the
linguistics, computer science and electrical engineering fields in order to develop method-
ologies and technologies, which enables the recognition and translation of spoken lan-
guage into texts, words or sentences by computers, computerized devices or robotics.
On the one hand, some type of speech recognition utilizes training sets where an
individual speaker, their words or isolated vocabulary into the system. The speech
recognition analyzes the person’s various voices and utilizes in order to achieve identifi-
cation of that person’s speech and increases accuracy recognition. On the other hand,
a speech recognition does not utilizes training sets is known as speaker independent
systems.
From the technology perspective, the speech recognition field has a long history
with several waves of major innovations. In the past 65 years of research, the field has
benefited from advances in machine learning and big data processing structure. The ad-
vances are evidenced importantly by the worldwide industry and company adoption of a
– 4 –
2.3 Speaker recognition
variety of machine learning, designing and deploying speech recognition systems not only
by the academic papers published in the field. These speech industry competitors are
Google, Microsoft, Hewlett Packard Enterprise, IBM, Baidu, Apple, Amazon, Nuance,
IflyTek, etc. Many of competitor proposes a core technology in their speech recognition
systems, which based on fundamental digital signal processing and information theory
[1].
2.3 Speaker recognition
A speaker recognition or voice recognition is an identification of a person from
characteristics of voice biometrics. The difference between speaker recognition and
speech recognition are described, speaker recognition is recognition who is speaker or the
act of authentication person and speech recognition is recognition what is person said.
Unfortunately, these two terms are frequently confused. Moreover, a voice recognition
is double represented for both different recognitions. The speaker recognition is able
to simplify the task of translating speech, which trains on specific person’s voices. The
speaker recognition is used to authenticate or verify the identity of a speaker as shown
in a part of a security process. The long history of speaker recognition is described as
shown in [1].
2.4 Summary of the Technology Progress
Brief summaries of the technology progress research in speech and speaker recogni-
tion was shown in [1]. It can be seen that systems are intensively carried out worldwide
spurred on by Technological advances in signal processing, algorithms, architectures
and hardware. The technological progress in the 65 years can be summarized by the
following changes;
– 5 –
2.4 Summary of the Technology Progress
1. Template matching to corpus-base statistical modeling, e.g. HMM and n-grams
2. Filter bank/spectral resonance to Cepstral features
3. Heuristic time-normalization to DTW/DP matching
4. The “distance”-based to likelihood-based methods
5. Maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI
6. Isolated word to continuous speech recognition
7. Small vocabulary to large vocabulary recognition
8. Context-independent units to context-dependent units for recognition
9. Clean speech to noisy/telephone speech recognition
10. Single speaker to speaker-independent/adaptive recognition
11. Monologue to dialogue/conversation recognition
12. Read speech to spontaneous speech recognition
13. Recognition to understanding
14. the single-modality to multimodal speech recognition
15. Hardware recognizer to software recognizer, and
16. No commercial application to many practical commercial applications.
The majority of technological changes have been directed toward the purpose of
increasing recognition robustness including many other significant techniques. Most
of these summaries cover in both the fields of speech and speaker recognition. These
recognitions have been developed for a wide variety of applications, ranging from small
vocabulary keyword recognition over dialed-up telephone lines to medium size vocabu-
lary voice interactive command and control systems for business automation, large vo-
cabulary speech transcription, spontaneous speech understanding and limited-domain
speech translation.
– 6 –
2.4 Summary of the Technology Progress
Table 2.1 State-of-the-art Level of ASR techniques
Processing Techniques State-of-the-art Level
Signal Conditioning 1
Speech Enhancement 3
Digital Signal Transformation 1
Analog Signal Transformation 1
Digital Parameter 2
Feature Extraction 3
Re-Synthesis 1
Orthographic Synthesis 3
Speaker Normalization 3
Speaker Adaptation 3
Situation Adaptation 3
Time Normalization 2
Segmentation And Labeling 2
Language Statistics 3
Syntax 2
Semantics 3
Speaker and Situation Pragmatics 3
Lexical Matching 3
Speech Understanding 2 – 3
Speaker Verification 1
Speaker Recognition 3
System Organization And Realization 1 – 3
Performance Evaluation 3
– 7 –
2.4 Summary of the Technology Progress
Many new technological methods were endorsed by researcher. Nevertheless, en-
countered a number of practical limitations that hinder a widespread deployment of
applications and services is still unsolved, which shown in [1] and summarized in table
2.1 where the state-of-the-art level 1 is various useful methods already overcome that
issues, level 2 is some methods have possibly to resolve or overcome the issues and level
3 is still no hope to solve the issues, in other words, “a long way to go”.
This research focuses on the improvement of Feature Extraction by using Wavelet
transform instead of Fourier transform, which describe the reasons of replacement in
chapter 3. Moreover, the proposed research behavior can be characterized by the fun-
damental of Wavelet transform from an input signal. Speaker Adaptation and Speech
Understanding is improved by using previous transform with the famous deep learning
algorithm such as Multi-layer of Artificial Neural Network, which described in chapter
4. The proposed research aims to develop the ability to classify command input signals,
improve accuracy of voice command recognition and support the computational lin-
guistics, speech controlled robot application, speech analysis-synthesis and recognition
applications.
– 8 –
Chapter 3
Wavelet Theory
A spectrogram is also known as spectral waterfall, voiceprint, or voice-gram, which
is a visual spectrum representation in a sounds, voices or signals depend on fluctuating
of time or other dependent parameters. The spectrogram is used to identifies spoken
words phonetically and to analyses the various calls of animals. The spectrograms
are utilized extensively in the development of the fields of music, sonar, radar, speech
processing, seismology, etc.
In order to construct a spectrogram, the Fourier transform is an effective method
for analyze the frequency components of the voices. However, if we focus the Fourier
transform over the whole time axis, it can be seen that transformation is imprecise
approximation only instant a particular frequency rises. A Short-time Fourier transform
utilizes a sliding window function in order to perform spectrogram, which gives the
information of both time and frequency. Nonetheless, the length of window function is
limited the resolution in frequency by some dependent parameters. A Wavelet transform
is proposed in order to suggest as solution to the problem above. Wavelet transform is
based on small wavelets with limited duration known as Mother Wavelet. The translated
version of wavelets is applicable concerned in order to optimize computational resource
as well as avenge of Fast Fourier transform, which were described in [2], [3], [4], [5], [6],
[7] and [8].
– 9 –
3.1 Overview
3.1 Overview
A Wavelet transform and a theoretical framework for wavelets is widely developed,
which offers a potential alternative variety of areas of sciences, such as intelligence
voice recognition system in computational linguistics and voice controlled application.
Before we investigate the valuable tool, we have to start on describe the first frequency
analysis tool named Fourier transform. Analyzing the signal with the Fourier transform
leads to information of each frequency spectrum. On the one hand, the standardization
analysis of Fourier transform is to explore a behavior of magnitude output |f(ω)|, which
does not refer to specified time lead to occur strange frequency peaks phenomenon.
Moreover, short transient outbursts are occurred lead to no noticeable contribution to
the frequency spectra. On the other hand, a Wavelet analysis supplies information both
time and frequency. Although, the both parameters are unable to exactly determined
simultaneously cause by the Heisenberg uncertainty principles relation. Analytically,
Wavelet transform utilizes the continuous version of various mathematical transforms.
However, the computer analysis requires sampled signals or consequently the discrete
versions of the transforms. Therefore, both continuous and discrete transforms are
investigated.
3.2 Literature Reviews
3.2.1 Frequency Analysis using Fourier Transform
A Fourier transform is an optimal method to analyze the frequency content. The
Fourier transform is named after its inventor named Joseph Fourier early 1800s [8]. The
Continuous Fourier Transform (CFT) of a function f is defined as follow;
f(ω) =
∫ ∞
−∞f(x)e−iωxdx (3.1)
– 10 –
3.2 Literature Reviews
and its inverse transform is defined as follow;
f(x) =1
2π
∫ ∞
−∞f(ω)eiωxdω (3.2)
where f(ω) is the amplitude of each sinusoidal wave eiωx in the function f(x). In
addition, its frequency analysis properties of the Fourier transform have some useful
mathematical relations, e.g., a convolution in a time domain corresponds to multipli-
cation in the frequency domain. Thus, the CFT utilizes in standard mathematical
derivations.
In contrast, a Discrete Fourier Transform (DFT) is discretized equation from f(x),
which a frequency output f [k] of a multi-periodic signal f [n] is defined as follow [8];
f [k] =1
N
N−1∑n=0
f [n]e−i2πkn/N (3.3)
and its inverse transform is defined as follow;
f [n] =N−1∑k=0
f [k]ei2πkn/N (3.4)
where f [k] is the amplitude of each sinusoidal wave eiωx in the discretized function
f [n]. In fact, the DFT is adapted to signals from real applications, which always have
finite length. Moreover, there is an outstanding method to perform the DFT called the
Fast Fourier Transform (FFT).
Above all, the Fourier transform is a suitable method for stationary signals.
Nonetheless, strange frequency peaks phenomenon is occurred.
3.2.2 Time-Frequency Analysis using Fourier Transform
A Short-time Fourier Transform (STFT) is an method to estimate the frequency
contents of function f(x) at an arbitrary time x = t is in order to cut off a piece of
– 11 –
3.2 Literature Reviews
frequency f and compute the Fourier transform in each time. The pervious notations
are easily argued for since it is a Fourier transform on a short piece of the function
during a short period of time and the restriction in time known as a translated window
or windows function [8]. STFT of a function f with respect to g is defined as follow;
Vgf(t, ξ) =
∫ ∞
−∞f(x)gt,ξ(x)dx (3.5)
where;
gt,ξ(x) = eiξxg(x− t) (3.6)
and g is known as real and symmetric window function, e.g., Rectangular, Ham-
ming, Blackman–Harris window, etc. It can be seen that STFT utilizes the sliding
window in order to perform spectrogram lead to offers an information of both time and
frequency.
Next section, we investigate a definition of the localization or time and frequency
spread of the window function g. The function g has the time spread σx(g) is given as
follow;
σ2x(g) =
∫ ∞
−∞(x− t)
2|gt,ξ(x)|2dx
σ2x(g) =
∫ ∞
−∞x2|g(x)|2dx
(3.7)
and frequency spread σω(g) is given as follow;
σ2ω(g) =
1
2π
∫ ∞
−∞(ω − ξ)
2|gt,ξ(x)|2dω
σ2ω(g) =
1
2π
∫ ∞
−∞ω2|g(ω)|2dω
(3.8)
where Vgf(t, ξ) is thought of as a measure of the frequency content of f at time t
and frequency ξ.
– 12 –
3.3 Fundamental of Wavelet Transform
STFT offers simple analytic calculation with simple sines and cosines orthogonal
functions. Moreover, recursive algorithms are applicable lead to revolution in scientific
computing. However, the length of window limits the resolution in each frequency band.
The algorithm does not handle fast variations of signal and discontinuities signal such
as overshoot and undershoot. Especially, STFT unable to guarantee correct calculation
with non-sinusoidal signal input. Hence, a new algorithm for time-frequency analysis is
proposed named Wavelet Transform.
3.3 Fundamental of Wavelet Transform
The wavelet transform is alternative effective measurement tool for time-frequency
analysis as a highlight of interest in this dissertation. STFT time-frequency window
gt,ξ is replaced by a time-scale window ψa,b with similar properties with important
differences in term of resolution, which explains in [2], [3], [4], [5], [6], [7], [8] and the
next section.
3.3.1 Continuous Wavelet Transform
A function ψ with condition∫∞−∞ ψ(x)dx = 0 is defined as a Wavelet for every f , ψ
defines the continuous wavelet transform. The Continuous Wavelet Transform (CWT)
is given as follow;
fψ(a, b) =
∫ ∞
−∞f(x)ψa,b(x)dx (3.9)
where;
ψa,b(x) =1√aψ
(x− b
a
)(3.10)
where the function ψ is known as a Mother Wavelet, which is chosen to be localized
– 13 –
3.3 Fundamental of Wavelet Transform
at x = 0 and at some ω = ω0 > 0. The a is input scales which represented as frequency
variable and b is input time variable.
CWT’s function ψa,b has time spread σx(ψa,b) and energy spread σω(ψa,b) around
ω0/a, which is defined as follows;
σ2x(ψa,b) =
∫ ∞
−∞x2|ψ(x)|2dx (3.11)
σ2ω(ψa,b) =
1
2π
∫ ∞
0
(ω − ω0)2|ψ(ω)|
2dω (3.12)
Moreover, Parseval’s formula gives a time-frequency interpretation;
∫ ∞
−∞f(x)ψa,b(x)dx =
1
2π
∫ ∞
−∞f(ω)ψa,b(ω)dx (3.13)
It can be that CWTmeasurement of the frequency is contented of f at the frequency
ω0/a and time b as well as CFT. In case of ψa,b(f) =√aψ(af), ψa,b is a dilated copy of ψ.
Changing the dilation parameter a, changes the support of ψa,b in time and rescales ψ.
Changing the translation parameter b, on the other hand, makes ψa,b change its location.
Hence, the varying of parameter set (a, b) computed on the entire time-frequency plane.
It can be seen that small scales correspond to high frequencies. This establishment of
the expression in Fourier analysis. This is reason the notation time-frequency plane is
utilized instead of the for Wavelet analysis because of that natural time-scale plane.
3.3.2 Discrete Wavelet Transform
In region of certain restrictions on the Mother Wavelet ψ, all information of the trans-
formed signal is protected when the Wavelet transform is sampled on certain discrete
subsets of the time-frequency plane. Precisely, the values of the continuous transform
in these points are the coefficients of a corresponding wavelet basis series expansion.
Reference of CWT equations as shown in (3.9). If the case a = 2−j , b = 2−jk, where
– 14 –
3.3 Fundamental of Wavelet Transform
j, k is considered, then;
ψ2−j ,2−jk(x) =1√2−j
ψ
(x− 2−jk
2−j
)ψ2−j ,2−jk(x) = 2−j/2ψ
(2−jx− k
) (3.14)
where wj,k is the represent the values of CWT known as a wavelet coefficients. The
coordinates(2−jk, 2−j
)represents a dyadic grid in the time-scale plane. The values
correspond to the correlation between f and ψa,b at specific points (a, b). This sampling
offers sufficient information in order to perform a perfect reconstruction of the signal
possible with special conditions on the wavelet function ψ are fulfilled. Moreover, it
can be seen that is possible to construct a function ψ such that (ψj,k)j,k forms an
orthonormal basis leads to a magnificent equation named a Discrete Wavelet Transform
(DWT). The first one to construct a smooth wavelet basis was Janos-Olov Stromberg,
who has now been followed by several others, e.g., Daubechies and Meyer. a wavelet
decomposition with orthonormal wavelet basis functions of a function f is given as
follow;
f =
∞∑j=−∞
∞∑k=−∞
wj,kψj,k (3.15)
where;
wj,k = ⟨f, ψj,k⟩ (3.16)
In equations (3.15) and (3.16), DWT is a doubly infinite summation over both the
time index k and the scale index j. However, DWT allow summation to performed by
the finite times with under an acceptable tolerance. For infinitely supported wavelets
case, the wavelet energy has to concentrated within a certain interval. Overall, the finite
summation over k is valid with appropriate approximation, which explains in the next
paragraph.
– 15 –
3.4 Heisenberg ’s Uncertainty Principles
The decomposition of the signal into different frequency bands is simply obtained by
successive high pass and low pass filter of a time domain signal with three steps. First,
the original signal is first passed through a one band high pass filter g and a one band
low pass filter h. Second, the half of the samples is eliminated by down-sample transfer
function according to the Nyquist’s rule lead to the signal has a highest frequency of
p/2 radians instead of p, in other words, the signal is subsampled by 2 with discarding
every other sample. Last, the next level of decomposition is recursively executed from
step one to two. This constitutes one level of decomposition is expressed as follow;
φ(x) =∞∑
k=−∞
hkφ(2x− k) (3.17)
ψ(x) =∞∑
k=−∞
gkφ(2x− k) (3.18)
where φ(x) is the dilation signals of DWT, ψ(x) is the output signals of DWT,
hk and gk are two scaling functions obtained from each wavelet function which (3.17)
and (3.18) known as the scaling coefficients and wavelet coefficients, respectively. This
method also known as multiresolution analysis.
The coefficients hk and gk from the scaling and wavelet equations (3.17) and (3.18)
is operated as low pass filters or “approximations” and high pass filters or “details”,
respectively. Theses filters are utilized in a fast filter banks algorithm known as Mallat’s
algorithm, which operation costs consume less than the FFT. The algorithm is briefly
illustrated in figure 3.1 by [2].
3.4 Heisenberg ’s Uncertainty Principles
The wavelet functions are localized both in time and frequency, However, wavelet
functions unable to an exact localization due to Heisenberg’s uncertainty principle. The
– 16 –
3.4 Heisenberg ’s Uncertainty Principles
1( )
2( )
( )
1( ) 2( ) ( ) +1( )
LPF
g
HPF
h
Downsample by 2
HPF
h
LPF
g
LPF
g
HPF
h
Downsample by 2
Downsample by 2
Fig. 3.1 Schematic representation of the discrete wavelet transform.
– 17 –
3.4 Heisenberg ’s Uncertainty Principles
localization measures σx(ψa,b) and σω(ψa,b) are illustrated as the sides of non-spaced
rectangles or boxes in the time-frequency plane shown in figure 3.2 by [6].
( , )
( , )
( )
( )
Fig. 3.2 Heisenberg boxes of wavelets (left) and STFT (right).
These all boxes have a certain area. Their sides are stretched and contracted by
the same factors a and a−1 as the corresponding wavelet functions. On the one hand,
the wavelet transform offers a higher time resolution at higher frequencies, which makes
the wavelet transform advantageous for signals analysis with contain both low and high
frequencies. Similarly, DWT filter bank is the suitable algorithm to obtain the scheme
of Heisenberg box shapes optimization. On the other hand, STFT equally offer the
resolution across the entire time-frequency plane. A short window allows the analysis
of transient components of a signal, such as high frequencies. A broader window allows
the analysis of low frequencies. It can be seen that STFT unable to analyze high and
low frequencies simultaneously.
– 18 –
Chapter 4
Neural Network Theory
Neural networks are motivated and initiated by the recognition problem in the
brain computes science field with an entirely different way from the conventional digital
computer. Ramon y Cajal introduced the idea of neurons as structural constituents of
the brain in 1911. Basically, neurons are five to six orders of magnitude slower than
computer logic gates. However, the brain adapts for the relatively slow rate operation of
a neuron by construct a countless number of nerve cells with massive interconnections
between the cells. Approximately, the estimated number of nerve cells is 10 billion
neurons in the human cortex and 60 trillion for connections result to the brain is an
enormously efficient structure.
Research in the field of artificial neural networks is attracted and increased since
1943. The first model of artificial neurons was presented by Warren McCulloch and
Walter Pitts. When Minsky and Papert published their book Perceptron in 1969, which
they showed the deficiencies of perceptron models, most neural network funding was
redirected and researchers left the field. Only a few researchers continued, such as
Teuvo Kohonen, Stephen Grossberg, James Anderson and Kunihiko Fukushima. Some
sophisticated proposals have been made from decade to decade. Finally, mathematical
analysis is solved the mysteries networks by the new models but still left many questions
open for future investigations. In other words, the study of neurons, interconnections,
and the brain’s elementary building blocks are one of the most dynamic and important
research fields in modern biology, which illustrate the relevance of endeavor between
– 19 –
4.1 Biological Inspiration
1901 and 1991 approximately 10% of the Nobel Prizes for Physiology and Medicine
were awarded to scientists who contributed to the understanding of the brain. It can
be seen that artificial neural network model is continuously researched and developed
by many researchers with from decade to decade in order to improve their issues.
This chapter gives an introduction to fundamental of neural network, learning rules,
network architectures, mathematical analysis of these networks and their application
to practical engineering problems or potential solution, such as nonlinear regression,
pattern recognition, signal processing, data mining, control systems and real world
problems, which reference from [9], [10], [11], [12], [13] and [14].
4.1 Biological Inspiration
The brain is a large number of highly connected elements known as neurons. In
human brain field, the purposed neurons models have three principal components con-
sists of the dendrites, the cell body and the axon. The dendrites are tree-like receptive
networks of nerve fibers, which carry electrical signals into the cell body. The cell body
effectively summations and thresholds these incoming signals. The axon is a single long
fiber, which carries the signal from the cell body out to other neurons. The point of
contact between an axon of one cell and a dendrite of another cell is called a synapse,
which is the arrangement of neurons and the strengths of the individual synapses and
determined by a complex chemical process. The example of the neural network model
is a simplified schematic diagram of two biological neurons as illustrated in figure 4.1
by [10], [11] and [12].
Regularly, some of the neural structure is defined at birth. Other parts are devel-
oped through learning as new connections are reconstruct and others unused connections
are removed. It can be seen that development is most noticeable in the early stages of
– 20 –
4.1 Biological Inspiration
Dendrites
AxonSynapse
Cell Body
Fig. 4.1 Schematic Drawing of Biological Neurons.
life. Neural structures continue to change throughout life. These later changes tend to
consist mainly of strengthening or weakening of synaptic junctions, e.g., it is believed
that new memories are formed by modification of these synaptic strengths. Thus, the
process of learning a new friend’s face consists of altering various synapses. Another
example, the hippocampi’s brain of London taxi drivers is significantly larger than aver-
age because of they must memorize a large amount of navigational information, which
process of learn and adapt takes more than two years.
Artificial neural networks have ability to simulate the complexity of the brain with
two key similarities between biological and artificial neural networks. First, the build-
ing blocks of both networks are simple computational devices. In contrast, artificial
neurons are much simpler than biological neurons. Artificial neural network is highly
– 21 –
4.2 Neuron Model
interconnected as well as biological neural network. Second, the connections between
neurons have to perform the function of the network. The determination the appropri-
ate connections to solve particular problems are described, which is primary objective of
this chapter. The biological neurons are very slow when compared to electrical circuits.
The brain is able to perform many tasks much faster than any conventional computer
because the massively parallel structure of biological neural networks. All of the neu-
rons are operating at the same time. Artificial neural networks simulate the parallel
structure as well as human brain. Moreover, their parallel structure makes them ideally
suited to implementation using VLSI, optical devices and parallel processors.
4.2 Neuron Model
4.2.1 General Neuron
A single input neuron is illustrated in figure 4.2 by [12]. The scalar input p is
multiplied by the scalar weight w to form wp, which one of the terms is sent to the
summation. The other scalar input 1 is multiplied by a bias b and then passed to the
summation. The summation output n is referred to the net input, which go into a
transfer function f in order to produces the scalar neuron output a. Another books
use the term activation function rather than transfer function and offset rather than
bias. Generally, this simple model relates to the biological neuron as shown in previous
section. The weight corresponds to the strength of a synapse, the cell body is represented
by the summation and the transfer function and the neuron output represents the signal
on the axon.
All in all, the neuron output is calculated as follow;
a = f(wp+ b) (4.1)
– 22 –
4.2 Neuron Model
1
Fig. 4.2 Single Input Neuron.
1
1 1
2 2
Fig. 4.3 Multiple Input Neuron.
The actual output depends on the particular transfer function, which is selected
by designer. The bias is like a weight except it has a constant input of 1. However, the
particular neuron allows the neuron to omit this bias. The weight w and bias b are both
adjustable scalar parameters of the neuron. Typically, the transfer function is chosen
– 23 –
4.2 Neuron Model
by the designer. The parameters w and b are adjusted by a learning rule , which the
neuron input and output relationship achieves some specific goals.
As a rule, a neuron also available more than one input, which a neuron with R
inputs is illustrated in figure 4.3 by [12]. The individual inputs vector p are each
weighted by corresponding elements of the weight vector w.
The neuron has a bias b, which is summed with the weighted inputs to form the
net input as given below;
n = b+
R∑i=1
wipi (4.2)
This expression can be written in vector form;
n = wTp+ b (4.3)
The output is expressed as;
a = f(wTp+ b) (4.4)
On the whole, the neural networks is described with matrices W . In each rows of
neural weight matrices represents each neuron’s weight vector connection, e.g., a set of
1st neuron’s weight connection is {w1,1, w1,2, w1,3, . . . , w1,i, . . . , w1,R} and 2nd neuron’s
weight connection is {w2,1, w2,2, w2,3, . . . , w2,i, . . . , w2,R}.
Thus, the matrix form is expressed as;
n = Wp+ b (4.5)
Likewise, the output can be expressed as;
a = f(Wp+ b) (4.6)
– 24 –
4.2 Neuron Model
These equations adapt a particular convention in assigning the indices of the ele-
ments of the weight matrix. The first index indicates the particular neuron destination
for that weight. The second index indicates the source of the signal feed to the neuron.
Thus, it can be seen that weight represents the connection to the first neuron from the
second source. This kind of matrix expression is utilized throughout this chapter and
research. Moreover, simple abbreviated notation of neuron can be illustrated as figure
4.4 with note the bias b is already included at output a [12], below;
1 1
2 2
Fig. 4.4 Abbreviated Notation of Neuron.
4.2.2 Transfer Functions
The transfer function in figure 4.1 or figure 4.2 possibly be a linear function or a
nonlinear function of n as given by [10], [11] and [12]. A particular transfer function
is chosen to satisfy the specification of the problem, which the neuron is attempted to
solve. A variety of transfer functions have been included. In this section, a log-sigmoid
and soft-max transfer function are described. First, the log-sigmoid transfer function is
described in figure 4.5.
The log-sigmoid transfer function utilized an input, which have any value between
plus and minus to squashes the output into the range 0 to 1, according to the expression
as follow;
– 25 –
4.2 Neuron Model
a =1
1 + e−n(4.7)
+1
-1
0
+1
-1
0
/
= logsig( ) = logsig( + )
Fig. 4.5 Log-Sigmoid Transfer Function.
= softmax( )
Fig. 4.6 Soft-Max Transfer Function.
The log-sigmoid transfer function is commonly utilized in multilayer neural net-
works, which are trained using the backpropagation algorithm or learning rule because
this function is differentiable. The reason of differentiable function is described in last
section.
Second, the soft-max transfer function is a generalization of the logistic function
in neural transfer function that squashes a vector n of arbitrary real values to a vector
a of real values in the range 0 to 1, which is described in figure 4.6. In neural network
– 26 –
4.2 Neuron Model
simulations, the soft-max function is generally implemented at the final layer of a net-
work in order to develop a classification problem. Such networks are trained under a
log loss or cross-entropy regime, which give a non-linear variant of logistic regression.
Table 4.1 List of Transfer Functions
Transfer Functions Name Input and Output Relation Short Name
Hard Limit a =
0 n < 0,
1 n ≥ 0.hardlim
Symmetrical Hard Limit a =
−1 n < 0,
1 n ≥ 0.hardlims
Linear a = n purelin
Saturating Linear a =
0 n < 0,
n 0 ≤ n ≤ 1,
1 n > 1.
satlin
Symmetric Saturating Linear a =
−1 n < 0,
n 0 ≤ n ≤ 1,
1 n > 1.
satlins
Log-Sigmoid a = 11+e−n logsig
Hyperbolic Tangent Sigmoid a = en−e−n
en+e−n tansig
Positive Linear a =
0 n < 0,
n n ≥ 0.poslin
Soft Max ai =ni∑j e
nj softmax
The soft-max transfer function is given as follow;
ai =ni∑j enj
(4.8)
The soft-max transfer function is the gradient-log-normalizer of the categorical
– 27 –
4.3 Network Architectures
probability distribution. Consequently, the soft-max transfer function is implemented
in various probabilistic multiclass classification methods including multinomial logistic
regression multiclass linear discriminant analysis and a Bayes classifier in artificial neural
networks.
Most of the transfer functions are implemented in worldwide network are summa-
rized in table 4.1 by [10].
4.3 Network Architectures
4.3.1 A Layer of Neurons
A single-layer network of S neurons is shown in figure 4.7, each of the R inputs is
connected to each of the neuron, which the weight matrix now has S rows.
The layer includes the weight matrix, the output vector a and the transfer function
boxes, which includes the bias vector and summations. Each element of the input vector
p is connected to each neuron through the weight matrix W . Each neuron has a bias
b, a summation, a transfer function f and an output. Altogether, the outputs form the
output vector a. Generally, the input vector elements enter the network through the
weight matrix W is given, below;
W =
w1,1 w1,2 · · · w1,R
w2,1 w2,3 · · · w2,R
......
. . ....
w3,1 w3,2 · · · wS,R
(4.9)
The row indices of the elements of matrix W indicate the destination neuron as-
sociated with that weight, while the column indices indicate the source of the input for
that weight, e.g., the indices in w2,3 say that this weight represents the connection to
the second neuron from the third source.
– 28 –
4.3 Network Architectures
1
1
2
3
2
1,1
,
Input Output
Fig. 4.7 Single Layer of Neural Networks.
4.3.2 Multiple Layers of Neurons
In the case of the network with several layers, each layer has its own weight matrix W ,
its own bias vector n, a net input vector and an output vector. The number of the layer
as a superscript to the names for each of these variables are appended, e.g., the weight
matrix for the first layer is written as W 1, the weight matrix for the second layer is
written as W 2. This example of this notation is implemented in the three-layer network
as illustrated in figure 4.8.
There are R inputs, S1 neurons in the first layer, S2 neurons in the second layer,
etc. Above all, the different layers allow network to have different numbers of neurons.
The outputs of first layers and second layers are the inputs for layers two and three.
– 29 –
4.3 Network Architectures
1
1
1
1
1
2
1,11
1 ,1
Input Hidden 1
13 3
23 3
33 3
33 3
2
2
2
2
Hidden 2 Output
2 , 12 3 , 2
3
11
11
1,12
22
12 1,1
3
Fig. 4.8 Three Layer of Neural Networks.
Thus, second layer is imagined as a one-layer network with R = S1 inputs, S = S2
neurons and an S2 × S1 weight matrix W 2. The input to second layer is a1 and the
output is a2. Those layer’s output is the network output is known as an “output layer”,
the other layers are known as “hidden layers”. The network shown above has an output
layer at third layer and two hidden layers at first and second layer.
In can be seen that multilayer networks are more powerful than single-layer net-
works, e.g., a two-layer network having a sigmoid first layer and a linear second layer
is allowed network to trained with approximate most functions arbitrarily well. In con-
trast, the single layer networks unable to offers an acceptable performance. All things
considered, the number of choices to be construct in specifying a network have to care-
fully considered. The number of inputs to the network and the number of outputs from
the network are defined by external problem specifications, e.g., there are four external
– 30 –
4.3 Network Architectures
variables to be utilized as inputs, then there are four inputs to the network. Similarly,
there are to be seven outputs from the network, then there have to seven neurons in
the output layer. The desired characteristics of the output signal also aid to select the
transfer function for the output layer, e.g., an output is to be either −1 or 1, then a
symmetrical hard limit transfer function is suggested as the transfer function. Thus,
the architecture of a single-layer network is almost completely determined by problem
specifications including the specific number of inputs and outputs and the particular
output signal characteristic.
Of course, the networks allow designer to choose neurons with or without biases.
The bias gives the network an extra variable. Unfortunately, the networks with biases
would be more powerful than those without. Because of that a neuron without a bias
will always have a net input of zero when the network inputs are zero. In the end, the
biases have to implemented without any confusions.
4.3.3 Recurrent Networks
A recurrent neural network is a class of artificial neural network where connec-
tions between units form a directed cycle. The recurrent neural network creates an
internal state of the network which allows it to exhibit dynamic temporal behavior.
Unlike feedforward neural networks, the recurrent neural network utilized their internal
memory to process arbitrary sequences of inputs, which makes the network applicable
to tasks such as unsegmented connected handwriting recognition or speech recognition.
Unfortunately, this network class is unfocussed by cause of problem specifications.
– 31 –
4.4 Performance Optimization
4.4 Performance Optimization
Searching suitable weight matrix W are required, which minimizes the chosen
function F (x) by generate a small step in weight space from W to W +δW to then the
change in the error function δF ∼= δWT∇F (W ) where the matrix ∇F (W ) points in the
direction of greatest rate of increase of the error function. The error F (W ) is a smooth
continuous function of W by its smallest value is occurred at a point in weight space
such that the gradient of the error function vanishes lead to condition ∇F (W ) = 0,
otherwise we calculate a small step in the direction of −∇F (W ) and thereby further
reduce the error. Points at which the gradient vanishes are known as stationary points
and further classified into minima, maxima and saddle points. The objective is to find
a matrix W such that F (W ) takes its smallest value. However, the error function
typically has a highly nonlinear dependence on the weights and bias parameters. There
have many points in weight space, which the gradient vanishes. For this reason, it
clearly seen that optimization is unable to found an analytical solution to the equation
∇F (W ) = 0. The optimization of continuous nonlinear functions is a widely studied
problem and there exists an extensive literature on performance optimization. Most
techniques involve choosing some initial value W (k) for the weight vector and then
moving through weight space in a succession of steps of the form;
W (k + 1) = W (k) + ∆W (k) (4.10)
where k is iteration step. Different algorithms involve different choices for the
weight update ∆W (k) which many algorithms make utilized the gradient information.
The simplest approach to using gradient information is to choose the weight update to
comprise a small step in the direction of the negative gradient is shown, below;
W (k + 1) = W (k)− α∇F (W (k)) (4.11)
– 32 –
4.5 Optimization Method: Backpropagation
where the parameter α > 0 is known as the learning rate. After each such update,
the gradient is re-evaluated for the new weight vector and the process repeated. Note
that the error function is defined with respect to a training set. In each step requires
that the entire training set be processed in order to evaluate ∇F . This simple approach
is known as gradient descent or steepest descent as shown in [9].
There are more efficient methods, such as Conjugate Gradients and Quasi-Newton
algorithm, which are much more robust and much faster than simple gradient descent.
This research utilized Scaled Conjugate Gradient algorithm [13]. This algorithm uti-
lizes 2nd information from the neural network. The performance of Scaled Conjugate
Gradient algorithm is benchmarked against the performance of the Conjugate Gradi-
ent backpropagation and the one-step Broyden-Fletcher-Goldfarb-Shanno memoryless
Quasi-Newton algorithm.
4.5 Optimization Method: Backpropagation
Backpropagation with an abbreviation for “backward propagation of errors” is a
common method of training artificial neural networks utilized in conjunction with an
optimization method such as gradient descent. The method calculates the gradient of a
loss function with respect to all the weights in the network. The gradient is fed to the
optimization method, which in turn uses gradient information to update the weights in
order to minimize the loss function. Importantly, backpropagation is required a desired
output for each input value in order to calculate the loss function gradient.
The backpropagation algorithm was originally introduced in the 1970s. However,
its importance was non-fully appreciated until a famous paper by David Rumelhart, Ge-
offrey Hinton and Ronald Williams since 1986, which describes several neural networks
where backpropagation works far faster than earlier approaches to learning, making
– 33 –
4.5 Optimization Method: Backpropagation
it possible to utilized neural network to solve problems, which had previously been
insoluble. The backpropagation are described in [9], [10], [11], [12], [13] and [14].
4.5.1 Performance Index
The backpropagation algorithm for multilayer networks is a generalization of the
least-mean-squares algorithm.
{p1, t1}, {p2, t2}, {p3, t3}, . . . , {pQ, tQ} (4.12)
where pQ is an input to the network and tQ is the corresponding target output.
As each input is applied to the network and the network output is compared to the
target. The algorithm allows neural network to adjust the network parameters in order
to minimize the mean square error as given below;
F (x) = E[e2] = E[(t− a)2] (4.13)
where x is the vector of network weights and biases. If the network has multiple
outputs this generalizes as follow;
F (x) = E[eTe] = E[(t− a)T(t− a)] (4.14)
Thus, we will approximate the mean square error by;
F (x) = (t(k)− a(k))T(t(k)− a(k)) = e(k)
Te(k) (4.15)
where the expectation of the squared error has been replaced by the squared error
at iteration k. Thus, the example of steepest descent algorithm for the approximate
mean square error given as follow;
wmi,j(k + 1) = wmi,j(k)− α∂F
∂wmi,j(4.16)
– 34 –
4.5 Optimization Method: Backpropagation
bmi (k + 1) = bmi (k)− α∂F
∂bmi(4.17)
where α is the learning rate. The computation of the partial derivatives F is
required which describe in next section.
4.5.2 Chain Rule
For the multilayer network the error is a none explicit function of the weights in the
hidden layers. Therefore, these derivatives are unable to computed directly equation in
(4.13) and (4.14).
Because the error is an indirect function of the weights in the hidden layers, the
chain rule of calculus to calculate the derivatives is employed. To review the chain rule,
suppose that we have a function f that is an explicit function only of the variable n.
We want to take the derivative of f with respect to a third variable w. The chain rule
is then;
df(n(w))
dw=df(n)
dn× dn(w)
dw(4.18)
The concept to find the derivatives in (4.16) and (4.17) are implemented, then;
∂F
∂wmi,j=
∂F
∂nmi× ∂nmi∂wmi,j
(4.19)
∂F
∂bmi=
∂F
∂nmi× ∂nmi∂bmi
(4.20)
The second term in each of these equations can be easily computed, since the net
input to layer m is an explicit function of the weights and bias in that layer;
nmi = bmi +sm−1∑j=1
wmi,jam−1j (4.21)
– 35 –
4.5 Optimization Method: Backpropagation
Therefore;
∂nmi∂wmi,j
= am−1j (4.22)
∂nmi∂bmi
= 1 (4.23)
Suppose that;
smi ≡ ∂F
nmi(4.24)
The sensitivity of to changes in the ith element of the net input at layer, then (4.19)
and (4.20) can be simplified to;
∂F
∂wmi,j= smi a
m−1j (4.25)
∂F
∂bmi= smi (4.26)
The approximate steepest descent algorithm is described as follow;
wmi,j(k + 1) = wmi,j(k)−−αsmi am−1j (4.27)
bmi (k + 1) = bmi (k)− αsmi (4.28)
All in all, case of matrix and vector form this becomes;
Wm(k + 1) = Wm(k)− αsm(am−1)T
(4.29)
bm(k + 1) = bm(k)− αsm (4.30)
– 36 –
4.5 Optimization Method: Backpropagation
sm =
∂F∂nm
1
∂F∂nm
2
∂F∂nm
3
...∂F∂nm
Sm
(4.31)
4.5.3 Sensitivities
The sensitivities sm is concerned importantly, which requires another application
of the chain rule. This process that gives us the term backpropagation, because it
describes a recurrence relationship in which the sensitivity at layer m is computed from
the sensitivity at layer m + 1. In order to derive the recurrence relationship for the
sensitivities, a Jacobian matrix is defined as follow;
∂nm+1
∂nm=
nm+11
nm1
nm+11
nm2
· · · nm+11
nmSm
nm+12
nm1
nm+12
nm2
· · · nm+12
nmSm
......
. . ....
nm+1
Sm+1
nm1
nm+1
Sm+1
nm2
· · ·nm+1
Sm+1
nmSm
(4.32)
In order to find an expression for the Jacobian matrix, consider the i, j element of
the matrix;
∂nm+1i
∂nmj=∂(bm+1
i +∑Sm
l=1 wm+1i,l aml )
∂nmj
∂nm+1i
∂nmj= wm+1
i,j
∂amj∂nmj
∂nm+1i
∂nmj= wm+1
i,j
∂fm(nmj )
∂nmj
∂nm+1i
∂nmj= wm+1
i,j˙fm(nmj )
(4.33)
Therefore, the Jacobian matrix can be written as;
– 37 –
4.5 Optimization Method: Backpropagation
∂nm+1
∂nm= Wm+1 ˙Fm(nm) (4.34)
where;
˙Fm(nm) =
˙fm(nm1 ) 0 · · · 0
0 ˙fm(nm2 ) · · · 0...
.... . .
...
0 0 · · · ˙fm(nmSm)
(4.35)
The recurrence relation for the sensitivity allow to described by using the chain
rule in matrix form;
sm =∂F
∂nm
sm =∂nm+1
∂nm× ∂F
∂nm+1
sm = ˙Fm(nm)(Wm+1)T ∂F
∂nm+1
sm = ˙Fm(nm)(Wm+1)Tsm+1
(4.36)
In order to find the starting point sM , which is allow to obtained at the final layer;
sMi =∂F
∂nMi
sMi =∂((t− a)
T(t− a))
∂nMi
sMi =∂(∑SM
j=1 (tj − aj)2)
∂nMi
sMi = −2(ti − ai)∂ai
∂nMi
sMi = −2(ti − ai)(∂aMi∂nMi
)
sMi = −2(ti − ai)(∂fM (nMi )
∂nMi)
sMi = −2(ti − ai) ˙fM (nMi )
(4.37)
– 38 –
4.5 Optimization Method: Backpropagation
Likewise, this can be expressed in matrix form as;
sM = −2 ˙FM (nM )(t− a) (4.38)
It can be seen that backpropagation algorithm derives its name. The sensitivities
are propagated backward through the network from the last layer to the first layer;
sm → sm−1 → · · · → s2 → s1 (4.39)
At this point it is worth emphasizing that the backpropagation algorithm utilized
the same approximate steepest descent technique that we utilized in the least mean
squares algorithm. The only complication is that in order to compute the gradient we
need to first back-propagate the sensitivities. The beauty of backpropagation is that we
have a very efficient implementation of the chain rule.
IN the end, this research utilized Scaled Conjugate Gradient algorithm. Conse-
quently, backpropagation algorithm is required in Scaled Conjugate Gradient algorithm
in order to minimizes weight matrix W .
– 39 –
Chapter 5
Implementation of Artificial
Neural Network and
Multilevel of Discrete Wavelet
Transform for Voice
Recognition
This chapter presents an implementation of simple Artificial Neural Network model
and multilevel of Discrete Wavelet Transform as feature extractions, which is achieved to
increase the high recognition rates up to 95% instead of Short-time Fourier Transform in
the conversation background noises at noises up to 65 dB. The performance evaluation
has been demonstrated in terms of correct recognition rate, maximum noise power of
interfering sounds, hit rates, false alarm rates and miss rates. The proposed method
offers a potential alternative to intelligence voice recognition system in speech analysis-
synthesis and recognition applications.
– 40 –
5.1 Introduction
5.1 Introduction
During the past 65 years, voice recognition is being extensively implemented for the
classification of sound types. The variety of voice recognition techniques have been de-
veloped to increase the efficiency of recognitive accuracy, statistical pattern recognition,
signal processing and recognition rates as shown in [1].
According to a lot of research, a number of algorithms have been proposed and
suggested as potential solutions to recognize human’s speech, i.e., the simply proba-
bility distribution fitting methods such as, Structural Maximum A Posteriori, Parallel
Model Composition and Maximum Likelihood Linear Regression. However, the issue of
sequential voice input had been being still unsolved.
Ferguson et al. has proposed Hidden Markov Model (HMM) in order to solve an
issue of sequential voice input. HMM was employed double stochastic process using an
embedded stochastic function in order to determine the value of the hidden states as
shown in [15]. High recognition rates design was essentially required state of the art of
architecture in HMM using Gaussian Mixture Model (GMM) as shown in [15] and [16].
GMM has been traditionally utilized voice models for voice recognition using two feature
extractions, a power logarithm of FFT spectrum in order to create Log-power spectrum
feature vectors and Mel-Scale Filter Bank Inverse FFT Dimension Reduction in order
to created Mel Frequency Cepstral Coefficient feature vectors. GMM offered high voice
recognition rate from 60 to 95% in a static environment by comparison with other
machine learning model such as Support Vector Machine and Dual Penalized Logistic
Regression Machine as shown in [15]. Nonetheless, large amounts of computational
resource are required in GMM.
Pitch-Cluster-Maps (PCMs) model was proposed by Yoko et al [17] in order to
replace the complex training sets with Binarized Frequency Spectrum resulted from
– 41 –
5.2 Proposed Voice Recognition
simple codebook sets using Short-time Fourier Transform [18], [19] and [20]. Vector
Quantization Approach method was employed lead suitable Real-Time computation
than GMM. Nonetheless, PCMs offered voice recognition rate up to 60% for 6 sound
sources environment under low frequency resolution.
This chapter aim to propose, the alternative voice recognition utilize Artificial
Neural Network and Multilevel of Discrete Wavelet Transform with 3 main advantages.
First, Discrete Wavelet Transform has resolved the low frequency prediction issue in
order to increases low frequency prediction. Second, the normal conversation back-
ground noise issue resolves in the proposed voice recognition. Last, the proposed voice
recognition has been improved recognition rates up to 95% by comparison with other
model.
5.2 Proposed Voice Recognition
The overview of proposed voice recognitions consisted of feature extraction, feature
normalization, machine learning as ANN and decision model which summarized in figure
5.1.
FeatureExtraction
FeatureNormalization
Machine Learning
InputDecision / Selection
Output
Fig. 5.1 The proposed voice recognition overview.
5.2.1 Feature Extraction
The proposed voice recognition utilized the feature extraction as the pre-processing
methods in order to transform the voices or signals to the time-frequency represented
data. Three pre-processing methods were implemented for voices feature extraction
– 42 –
5.2 Proposed Voice Recognition
consisted of Short-time Fourier Transform (STFT) and Discrete Wavelet Transform
(DWT). In general case, Continuous Wavelet Transform (CWT) can be expressed as
(3.9) and (3.10) where ψa,b(x) is the conjugate of Wavelet function, a is input scales
which represented as frequency variable, b is input time variable, f(t) is the continuous
signal to be transformed and fψ(a, b) is the CWT of a complex function represented the
magnitude of the continuous signal over time and frequency based on specified Wavelet
function.
In particular, DWT transformation decomposes the signal into mutually orthogonal
set of wavelets, which is the main difference from the CWT or its implementation for the
discrete time series as shown in pervious chapter. DWT provides sufficient information
in both time and frequency with a significant reduction in the computation time than
CWT. DWT can be constructed from convolution of the signal with the impulse response
of the filter expressed as (3.17) and (3.18), redefine as;
ϕ(x) =∞∑
k=−∞
wkϕ(Sx− k) (5.1)
where ϕ(x) is the dilation reference equation as discrete signals from input to output
states, S is a scaling factor to be assign value to 2, x is time index and wk consists of
scaling and wavelet functions obtained from each Mother Wavelet know as Quadrature
Mirror Filter. DWT equation can be represented as a binary hierarchical tree of Low
Pass Filter (LPF) and High Pass Filter (HPF), in other words, it can be defined as
Filter Banks as shown in figure 5.2. In Filter Banks analysis, lengths of discrete signals
are reduced by halved per level. The effect of shifting and scaling process (3.9) and
(5.1) produces a time-scale representation as shown in figure 5.3. The graphs show the
signal amplitudes in both of time and frequency domain using STFT for left-hand graph
and CWT for right-hand graph. The vertical axis is represented frequency band and
horizontal axis is represented time domain. It can be seen from a comparison with STFT
– 43 –
5.2 Proposed Voice Recognition
and CWT, Wavelet Transform offers a superior temporal resolution of time resolution
at high frequency components and scale resolution at low frequency components which
usually give a voice signal and its main characteristics or identity.
5.2.2 Feature Normalization
In order to increases the speed convergences of the machine learning algorithm,
Feature Normalization method is utilized with simplest form is given as follow;
x =x−min(x)
max(x)−min(x)− x (5.2)
where x is the normalized vector and x is original vector determine from feature
extraction and x is its offset of signals from zero. Feature Normalization offers the range
of original vector to scale the range between 0 and 1.
5.2.3 Artificial Neural Network Model
Artificial Neuron Network (ANN) is an adaptive system that changes structure
based on external and internal information that flows through the network. ANN is
considered nonlinear statistical data modeling tools where the complex relationships
between inputs and outputs are modeled or patterns are found. Therefore, the proposed
voice recognition utilized ANN in order to recognize a characteristics or identity of
human speech.
The novel network topology name the nth-order All-features-connecting topology
is represented by Hn as illustrated in figure 5.4 where xf is an input vector in each
frequency band which is calculated from feature extraction model and y is a class prob-
ability vector which is calculated by ANN. Hn model utilizes the network of A, B and
C-class in order to construct simple network topology with those network were shown
in table 5.1 The Proposed Neuron Network Architectures. The four main conditions of
– 44 –
5.2 Proposed Voice Recognition
in 1( )
LPFHPF
Downsample by 2
HPF LPF
LPFHPF
Downsample by 2
Downsample by 2
out 1( ) out 2
( ) out ( ) in +1( )
in 2( )
in ( )
Fig. 5.2 The Proposed DWT Filter Bank representation.
– 45 –
5.2 Proposed Voice Recognition
Time (s) Time (s)F
requen
cy (
Hz)
Fre
quen
cy (
Hz)
4k 4k
3k3k
2k2k
1k 1k
0 0.40.2 0 0.40.2
Fig. 5.3 The comparison of STFT (left) and CWT (right).
the novel network topology are defined. First, numbers of layers are defined from order
of Hn wheren > 0. Second, numbers of input networks are related to number of input
time index, i.e., size of scales vector for CWT and number of levels for DWT. Third,
input networks are required the connection of all single outputs to the first block of
network series. The input, middle and output networks are A, B and C-class network,
respectively, which give as a last condition.
Table 5.1 The Proposed Neuron Network Architectures
Class Name Control Input Control Output Transfer Function
A-class Single Single Log-sigmoid
B-class Multiple Single Log-sigmoid
C-class Single Single Softmax
In order to train specified ANN, Scaled Conjugate Gradient Backpropagation su-
pervised learning rule is employed. Additionally, specified ANN utilizes pre-learning
rule using Autoassociator, to initiate weights approximation of final solution lead to
– 46 –
5.2 Proposed Voice Recognition
C
B
AAAA A A A A
A
1 2 3 4 5 6 7 8
3
2
1
Fig. 5.4 All-features-connecting topology.
accelerate the convergence of the error Backpropagation learning algorithm and reduce
dimension from wavelet packet series.
5.2.4 Decision Model
The output of ANN is represented as vector of class possibility value base on feature
set. The decision model is expressed as maximum of class possibility value;
c = argmaxi∈ℵ
(yi) (5.3)
where c is maximum possibility class value and yi is element of y where y =
(y1, y2, . . . , yn)Tin each class number i which is calculated from ANN.
– 47 –
5.3 Experiment Setup
5.3 Experiment Setup
The proposed voice recognition is implemented using MATLAB®. The record-
ing devices utilizes Audio-Technica® AT-VD3 microphone and ROLAND® UA-101 Hi-
speed USB audio capture. The samples select 5 Japanese including 2 youth in both
male and female speakers and 1 middle-age Japanese male speaker. In order to perform
word classification, the speaker pronounces the reference words from International Pho-
netic Alphabet (IPA) [21] datasets which were described in table 5.2. The features set
assigns voice input to 8 kHz sampling frequency, 16-bits data resolution and 8000 sam-
ple points. The numbers of features set are 450 elements obtains from reference words
in dataset with 5 times repeated. In order to perform the performance evaluation, the
experiments are selected 20% for tests set and 80% for features set.
Table 5.2 Features Set for Experimentation
Class Words IPA’s
1 パン pað
2 番 bað
3 先ず mazu
4 太陽 taijo
5 段々 daðdað
6 通知 tsu:tsi
7 何 nani
8 蘭 óað
9 数字 su:si
Class Words IPA’s
10 雑 zatsuzi
11 山 jama
12 脈 mjaku
13 風 kaze
14 外套 gaito:
15 医学 igaku
16 善意 zeði
17 鼻 hana
18 わ wa
– 48 –
5.4 Experimental Results
5.4 Experimental Results
The performance evaluation was established in term of correct recognition rate which
calculated from the summation of true positive and true negative rates in each class.
Moreover, maximum noise power of interfering sounds with nonlinear logarithmic scale
defines as follows;
Pnoise,dB = 10log10
(Pnoise
Pref
)(5.4)
where Pnoise,dB is noise power level in decibel (dB), Pnoise is noise power level in
watt and Pref is reference power level in watt (W). The experimentation assign Pref is
10−12 W as a reference for ambient noise level in order to map voice signal conditions
over a spatial regime.
Three experiments were conducted with subject to word classification in order to
examine appropriate values of Wavelet and ANN parameters. The first experiment
proposed an examination of Wavelet function category and its order using set of static
parameters shown in table 5.3. The experimental results consisted of three Wavelet
functions included Daubechies, Symlet and Coiflet Wavelet function with each order
from 1 to 16. It is definitely seen from table 5.4 that several Wavelet function achieved
word classification with correct recognition rates greater than 80% and noise power of
interfering sounds greater than 50 dB. TheWavelet function was selected by two satisfied
conditions, maximum values of noise power of interfering sound and correct recognition
rate. Therefore, Daubechies 15 Wavelet function revealed the satisfied maximum values
of noise power of interfering sound 65.5 dB and correct recognition rate 96.22%.
However, cost functions of proposed voice recognition were obviously influenced by
the effect of Wavelet function, Wavelet level and ANN network topology. Hence, the
second experiment was designed to optimize Wavelet level and ANN network topology
using set of static parameters as shown in table 5.5. It is obviously seen from table 5.6
– 49 –
5.4 Experimental Results
Table 5.3 The First Experimental Configuration
Parameter Name Value
Subject Word Classification
Feature Extraction Method Discrete Wavelet Transform (DWT)
Wavelet Level 6
Wavelet Function variable parameter
Network Topology 3rd-order All-features-connecting topology (H3)
Node Size in Each Layer {1000, 4000, 1000, 18}
Table 5.4 The First Experimental Results
Wavelet Function
OrderDaubechies (db) Symlet (sym) Coiflet (coif)
Pnoise,dB Recognition Rate (%) Pnoise,dB Recognition Rate (%) Pnoise,dB Recognition Rate (%)
1 24.50 90.44 none none 64.50 94.89
2 60.38 93.11 0.00 88.44 62.50 93.33
3 63.63 94.00 63.75 94.22 63.50 96.00
4 64.25 92.44 55.50 94.00 42.50 92.00
5 55.13 94.67 63.25 94.67 none none
6 67.75 94.89 65.25 96.00 none none
7 56.00 93.78 61.00 94.67 none none
8 57.50 95.56 35.50 91.11 none none
9 34.50 92.44 67.50 94.22 none none
10 36.25 93.78 64.25 94.44 none none
11 59.00 94.00 0.00 84.00 none none
12 61.50 95.56 61.25 95.11 none none
13 67.50 95.11 67.75 95.78 none none
14 58.00 95.33 65.25 94.67 none none
15 65.50 96.22 27.75 91.33 none none
16 61.50 96.88 65.75 94.44 none none
– 50 –
5.4 Experimental Results
Table 5.5 The Second Experimental Configuration
Parameter Name Value
Subject Word Classification
Feature Extraction Method Discrete Wavelet Transform (DWT)
Wavelet Level variable parameter
Wavelet Function Symlet 7 (sym7)
Network Topology variable parameter
Node Size in Each Layer variable parameter
Table 5.6 The Second Experimental Results
Wavelet Level Network Topology Node Size in Each Layer Pnoise,dB Recognition Rate (%)
1 H1 {1000, 18} 0.00 90.89
2 H1 {1000, 18} 33.00 93.56
1 H2 {1000, 1000, 18} 0.00 90.44
2 H2 {1000, 1000, 18} 29.75 93.33
3 H2 {1000, 1000, 18} 39.75 94.00
4 H2 {1000, 1000, 18} 48.50 94.67
1 H3 {1000, 4000, 1000, 18} 0.00 90.22
2 H3 {1000, 4000, 1000, 18} 28.25 92.22
3 H3 {1000, 4000, 1000, 18} 38.25 94.89
4 H3 {1000, 4000, 1000, 18} 52.25 94.44
5 H3 {1000, 4000, 1000, 18} 54.00 94.00
6 H3 {1000, 4000, 1000, 18} 61.00 94.67
7 H3 {1000, 4000, 1000, 18} 60.00 95.33
8 H3 {1000, 4000, 1000, 18} 62.25 95.78
– 51 –
5.4 Experimental Results
Table 5.7 The Third Experimental Configuration
Parameter Name Value
Subject Word Classification
Feature Extraction Method variable parameter
Wavelet Level 6
Wavelet Function Symlet 7 (sym7)
STFT Windows Hamming
STFT Time Slot 1 millisecond
STFT Frequency Separation 8
Network Topology 3rd-order All-features-connecting topology (H3)
Node Size in Each Layer {1000, 4000, 1000, 18}
Table 5.8 The Third Experimental Results
Feature Extraction Method Pnoise,dB Recognition Rate (%)
Discrete Wavelet Transform (DWT) 61.00 94.67
Short-time Fourier Transform (STFT) 0.00 88.67
that H3 model with Wavelet level 4 to 8 achieved word classification with noise power
of interfering sounds greater than 60 dB and correct recognition rates greater than
94%. Hence, H3 model with Wavelet level 6 was selected with two satisfied conditions
criteria, minimizing computation and verify the validity inside the ROI in human speech
frequency form 130 to 4 kHz. H3 model with Wavelet level 6 was selected which gives
the maximum values with correct recognition rate 94.67% and noise power of interfering
sound 61 dB.
Finally, the last experiment was designed to verify the hypothesis which Wavelet
Transform feature extraction is suitable for the voice recognition application instead
– 52 –
5.5 Discussions
of STFT as shown in table 5.7 and table 5.8. It is apparent seen that the correct
recognition rates and noise power of interfering sounds in DWT achieved to increase
high recognition rates than of STFT by reason of DWT theoretically employs multi-
resolution lead to offers the main characteristics or identity of voice at low frequency
boundary which depends on Wavelet function and length of input signal.
5.5 Discussions
The summaries of the optimized parameters with both of word and gender classi-
fication were described in table 5.9. It can be seen that the proposed voice recognition
with the optimized parameters offered high correct recognition rate and noise power
were 96.22% and 65.5 dB which sufficient for word classification. Moreover, the pro-
posed voice recognition with the optimized parameters offered the correct recognition
rate and noise power were 99.8% and 72.25 dB which acceptable for gender classification.
Table 5.9 The parameter optimization results
Parameter NameSubject
Word Classification Gender Classification
Feature Extraction Method DWT DWT
Wavelet Level 6 6
Wavelet Function db15 db15
Network Topology H3 H3
Node Size in Each Layer {1000, 4000, 1000, 18} {1000, 4000, 1000, 2}
Pnoise,dB 65.50 72.25
Recognition Rate (%) 96.22 99.80
The proposed voice recognition performance was established in term of the bound-
– 53 –
5.6 Conclusions
ary of hit rate, false alarm and miss rate with gender classification in order to compare
with other models, i.e., simple sound database named Pitch-Cluster-Maps (PCMs). The
performance of PCMs models established in term of Detection Error Tradeoff (DET)
curves with gender classification, in other words, it can be defining as upper and lower
boundary both of false alarm rate and miss rate. The best performance of hit rate
requires set of predicted data which approach to 100% on true positive rate. In con-
trast, the best performance of false alarm and miss rate requires set of predicted data to
approach on the false positive rate and false negative rate being closely equal to 0 and
0%, respectively. Therefore, lower boundary of hit rate, upper boundary of false alarm
rate and upper boundary of miss rate were important for performance evaluation. The
proposed voice recognition performance was shown in table 5.10.
Table 5.10 Performance evaluation
GenderLower Boundary Upper Boundary
Hit Rate (%) False Alarm Rate (%) Miss Rate (%)
Male 99.63 2.78 0.37
Female 97.22 0.37 2.78
5.6 Conclusions
This chapter presented an alternative voice recognition using combination of Arti-
ficial Neural Network and Multilevel of Discrete Wavelet Transform. The experimental
results proved Wavelet Transform was achieved to increases high recognition rates up to
95% instead of Short-time Fourier Transform feature extractions at noises up to 65 dB
as in normal conversation background noises. The performance evaluation was demon-
strated in terms of correct recognition rate, maximum noise power of interfering sounds,
– 54 –
5.6 Conclusions
hit rate, false alarm rate and miss rate. The proposed method offers a potential alterna-
tive to intelligence voice recognition system in speech analysis-synthesis and recognition
applications.
– 55 –
Chapter 6
Reinforced Voice Recognition
using Distributed Artificial
Neural Network with
Time-Scale Wavelet Transform
Feature Extraction
This chapter presents a new voice recognition method named as Signal Clustering
Neural Network by Distributed Artificial Neural Network model with a single chan-
nel microphone and enhancement of Wavelet Transform feature extractions, which is
achieved to increase the high recognition rates up to 95% instead of Short-time Fourier
Transform feature extractions at noises up to 70 dB as in the normal conversation
background noises. The performance evaluation has been demonstrated in terms of cor-
rect recognition rate, maximum noise power of interfering sounds, Receiver Operating
Characteristic and Detection Error Tradeoff curves. The proposed method offers a po-
tential alternative to intelligence voice recognition system in computational linguistics
and speech controlled robot application.
– 56 –
6.1 Introduction
6.1 Introduction
Voice recognition is being extensively implemented for the classification of sound
types. During the past 65 years, voice recognition techniques have been developed to
increase the efficiency of recognitive robustness, statistical pattern recognition, signal
processing and recognition rates as shown in [1]. Repeatedly, a number of algorithms
have been proposed and suggested as potential solutions to recognize human voice pat-
terns, i.e., the simply probability distribution fitting methods such as, Maximum Like-
lihood Linear Regression (MLLR), Structural Maximum A Posteriori (SMAP), Paral-
lel Model Composition (PMC), Hidden Markov Model (HMM) and Gaussian Mixture
Model (GMM).
In 2009, Yoko et al. proposed Pitch-Cluster-Maps (PCMs) model based on Bi-
narized Short-time Fourier Transform. In particular, PCMs employed simple Vector
Quantization Approach method lead to suitable Real-Time computation than GMM.
Nonetheless, PCMs offered voice recognition rate from 50 to 60% for 6 sound sources
environment under low frequency resolution.
In previous chapter, the alternative voice recognition utilizes Artificial Neural Net-
work and Multilevel of Discrete Wavelet Transform was proposed. On the one hand,
the previous proposed voice recognition has been improved recognition rates up to 95%
human region frequency resolution. Discrete Wavelet Transform has resolved the low
frequency prediction issue in order to increases low frequency prediction. The normal
conversation background noise issue is resolved. On the other hand, lengths of discrete
signals are reduced by halved per level in Filter Banks of Discrete Wavelet Transform
lead to unbalance recognition in each frequency band approximations. the previous
proposed voice recognition requires a large amount of computational resource in or-
der to utilizes additional learning rule in previous proposed network topology such as
– 57 –
6.2 Proposed Voice Recognition
Autoassociator learning rule.
This paper aim to propose, the new voice recognition named Sound Clustering
Neural Network (SCNN) with resolve 2 main issues. SCNN is implemented using Dis-
tributed Artificial Neural Network lead to intelligent computational resource manage-
ment. SCNN has been resolved the unbalance frequency prediction issue by using the
Time-scale Wavelet Transform. The normal conversation background noise issue is
resolved as well as the previous proposed voice recognition. Last, SCNN has been im-
proved recognition rates up to 95% by compare to previous proposed voice recognition
and other model.
6.2 Proposed Voice Recognition
The overview of proposed voice recognitions consisted of new feature extraction,
feature normalization, machine learning as Distributed Artificial Neural Network and
decision model which already summarized in previous chapter figure 5.1 which new
feature extraction and Distributed Artificial Neural Network explains in the next section.
6.2.1 New Feature Extraction
The proposed voice recognition utilized the feature extraction as the pre-processing
methods in order to transform the voices or signals to the time-frequency represented
data. Three pre-processing methods were implemented for voices feature extraction
consisted of Short-time Fourier Transform (STFT), Discrete Wavelet Transform (DWT)
and Time-scaled Discrete Wavelet Transform (TSDWT).
DWT can be constructed from convolution of the signal with the impulse response
of the filter expressed as (5.1) where ϕ(x) is the dilation reference equation as discrete
signals from input to output states, S is a scaling factor to be assign value to 2, x is
– 58 –
6.2 Proposed Voice Recognition
time index and wk consists of scaling and wavelet functions obtained from each Mother
Wavelet know as Quadrature Mirror Filter. Moreover, DWT equation can be defined
as Filter Banks as illustrated in figure 5.2.
In Filter Banks analysis, lengths of discrete signals are reduced by halved per
level. In order to keep information identity in discrete signals, time scale modification
is proposed, named Time-scaled Discrete Wavelet Transform (TSDWT) expressed as;
ϕout(x) =
{ϕout
(x2
)if x is even,
ϕout(x+12
)otherwise.
(6.1)
where ϕout(x) is the dilation equation references as post evaluation of discrete
signals in (5.1). ϕout(x) is discrete time-scaled signals. It can be seen from a comparison
with the STFT, TSDWT offers a superior temporal resolution of the low and high
frequency components as well as DWT. Likewise, TSDWT can be defined as Filter
Banks as shown in figure 6.1. The proposed model allows the low frequency components,
which usually give a voice signal and its main characteristics or identity, as shown in
figure 6.2. The graphs show the Wavelet coefficients as signal amplitudes in both of
time and frequency domain using CWT for left-hand graph and TSDWT for right-hand
graph. The both of CWT Scales and TSDWT levels are represented as frequency band.
6.2.2 Distributed Artificial Neural Network Model
The proposed voice recognition utilized ANN construction with multiple hidden
layers of units between the input and output layers with previous three and new two
different classes as shown in table 6.1.
The new novel network topology name the nth-order Double-features-connecting
topology is represented by Pn as illustrated in figure 6.3 where xf is an input vector
in each frequency band which is calculated from feature extraction model and y is a
class probability vector which is calculated by ANN. On the one hand, the Hn model
– 59 –
6.2 Proposed Voice Recognition
in 1( )
LPFHPF
Downsample by 2
HPF LPF
LPFHPF
Downsample by 2
Downsample by 2
out 1( ) out 2
( ) out ( )
in +1( )
in 2( )
in ( )
Multi-order Time-scale
Fig. 6.1 The Proposed TSDWT Filter Bank representation.
– 60 –
6.2 Proposed Voice Recognition
Time (s) Time (s)Lev
el
Sca
le
7
0 0.40.2 0 0.40.2
6
54
1
120
100
80
60
40
20
Fig. 6.2 The comparison CWT (left) and TSDWT (right).
Table 6.1 The Proposed Neuron Network Architectures
Class Name Control Input Control Output Transfer Function
A-class Single Single Log-sigmoid
B-class Multiple Single Log-sigmoid
C-class Single Single Softmax
D-class Double Single Log-sigmoid
E-class Double Single Softmax
utilizes the network of A, B and C-class to construct simple network topology from
four conditions. First, numbers of layers are defined from order of Hn wheren > 0.
Second, numbers of input networks are related to number of input time index, i.e., size
of scales vector for CWT and numbers of levels for TSDWT. Third, input networks are
required the connection of all single outputs to the first block of network series. The
input, middle and output networks are A, B and C-class network, respectively, which
give as a last condition as shown in figure 5.4. On the other hand, the Pn model utilizes
– 61 –
6.3 Experiment Setup
the network of A, D and E-class to construct binary hierarchical network topology lead
to parallel computing ability and minimizing computation form four conditions. First,
numbers of layers are defined from order of Pn withn > 0. Moreover, numbers of layers
are defined as 2n. Second, numbers of input networks are related to number of time
index input. Third, layer networks are required to connect as binary tree structure. The
input, middle and output network are defined as A, D and E-class network, respectively,
which give as a last condition.
D
E
DD D
D
D
A A A A A A A A
1 2 3 4 5 6 7 8
3
2
1
Fig. 6.3 Double-features-connecting topology.
6.3 Experiment Setup
The proposed voice recognition utilizes development tools as well as the previous
experimentation. Likewise, the samples select 5 Japanese including 2 youth in both male
and female speakers and 1 middle-age Japanese male speaker. The speaker pronounces
– 62 –
6.4 Experimental Results
the reference words from International Phonetic Alphabet (IPA) datasets which were
described in table 5.2. The features set assigns voice input to 8 kHz sampling frequency,
16-bits data resolution and 8000 sample points. The numbers of features set are 450
elements obtains from reference words in dataset with 5 times repeated. In order to
perform the performance evaluation, the experiments are selected 20% for tests set and
80% for features set.
6.4 Experimental Results
The performance evaluation was established in term of correct recognition rate
which calculated from the summation of true positive and true negative rates in each
class. Moreover, maximum noise power of interfering sounds with nonlinear logarithmic
scale was measured as follows in (5.4).
Table 6.2 The First Experimental Configuration
Parameter Name Value
Subject Word Classification
Feature Extraction Method Time-scale Discrete Wavelet Transform (TSDWT)
Wavelet Level variable parameter
Wavelet Function Daubechies 16 (db16)
Network Topology variable parameter
Node Size in Each Layer variable parameter
Two system of word classification experiments were conducted in order to examine
the appropriate values of proposed voice recognition parameters. The cost functions of
proposed voice recognition were obviously influenced by the effect of Wavelet function,
Wavelet level and ANN network topology. Hence, the first experiment was designed
– 63 –
6.4 Experimental Results
Table 6.3 The First Experimental Results
Wavelet Level Network Topology Node Size in Each Layer Pnoise,dB Recognition Rate (%)
1 H1 {1000, 18} 0.00 90.89
2 H1 {1000, 18} 33.00 93.56
1 H2 {1000, 1000, 18} 0.00 90.44
2 H2 {1000, 1000, 18} 29.75 93.33
3 H2 {1000, 1000, 18} 39.75 94.00
4 H2 {1000, 1000, 18} 48.50 94.67
1 H3 {1000, 4000, 1000, 18} 0.00 90.22
2 H3 {1000, 4000, 1000, 18} 28.25 92.22
3 H3 {1000, 4000, 1000, 18} 38.25 94.89
4 H3 {1000, 4000, 1000, 18} 52.25 94.44
5 H3 {1000, 4000, 1000, 18} 54.00 94.00
6 H3 {1000, 4000, 1000, 18} 61.00 94.67
7 H3 {1000, 4000, 1000, 18} 60.00 95.33
8 H3 {1000, 4000, 1000, 18} 62.25 95.78
2 P1 {1000, 18} 36.75 93.78
4 P2 {1000, 500, 18} 52.75 94.67
8 P3 {1000, 1000, 500, 18} 65.50 94.00
to optimize Wavelet level and ANN network topology using set of static parameters
as shown in table 6.2. The experimental results consisted of two network topologies,
nth-order All-features and Double-features connecting topology with each Wavelet level
from 1 to 8. It is apparently seen from table 6.3 that H3 model with Wavelet level 4
to 8 and P3 model achieved word classification with correct recognition rates greater
than 95% and noise power of interfering sounds greater than 50 dB. Hence, H3 model
with Wavelet level 6 and P3 model were selected with three satisfied condition criteria,
minimizing computation, parallel computing ability and verify the validity inside the
– 64 –
6.4 Experimental Results
region of interest in both of male and female human speech frequencies form 130 to
3.5 kHz and 250 to 4 kHz, respectively. H3 model with Wavelet level 6 was selected
which gives the maximum values with correct recognition rate 95.56% and noise power
of interfering sound 67.5 dB.
The second experiment was designed to prove the Wavelet Transform feature ex-
traction is suitable for the voice recognition application instead of STFT as shown in
table 6.4 and table 6.5. It is obviously seen that the correct recognition rates and noise
power of interfering sounds in both of DWT and TSDWT achieved to increase high
recognition rates than of STFT feature extractions as the hypothesis. TSDWT and
DWT were theoretically employs multi-resolution lead to offers higher accuracy at low
frequency band which depends on Wavelet function and length of input signal. Never-
theless, TSDWT normalizes the priority of frequency range using time scale modification
instead of original DWT as shown in (5.1) , (6.1).
Table 6.4 The Second Experimental Configuration
Parameter Name Value
Subject Word Classification
Feature Extraction Method variable parameter
Wavelet Level 6
Wavelet Function Symlet 7 (sym7)
STFT Windows Hamming
STFT Time Slot 1 millisecond
STFT Frequency Separation 8
Network Topology 3rd-order All-features-connecting topology (H3)
Node Size in Each Layer {1000, 4000, 1000, 18}
– 65 –
6.5 Discussions
Table 6.5 The Second Experimental Results
Feature Extraction Method Pnoise,dB Recognition Rate (%)
Time-scale Discrete Wavelet Transform (TSDWT) 68.60 96.22
Discrete Wavelet Transform (DWT) 61.00 94.67
Short-time Fourier Transform (STFT) 0.00 88.67
6.5 Discussions
The summaries of the optimized parameters with two subject classification were
described in table 6.6. It can be seen that the proposed voice recognition with the op-
timized parameters offered high correct recognition rates and noise power were 95.56%
and 71.25 dB, respectively, which sufficient for word classification. For gender classifica-
tion, the proposed voice recognition with the optimized parameters offered the correct
recognition rates and noise power were 99.33% and 62.5 dB.
By comparison with the other models, i.e., simple sound database named Pitch-
Cluster-Maps (PCMs) [17] based on Vector Quantization approach, the performance of
models established in term of Receiver Operating Characteristic (ROC) and Detection
Error Tradeoff (DET) curves using subject to gender classification. The best perfor-
mance of ROC requires set of predicted data to approach on the false positive and true
positive rates axis being closely equal to (0, 1) on the graph. The best performance
of DET curve requires a set of predicted data which approach to (0, 0) on the false
alarm probability and miss probability axis. Therefore, ROC graph of the proposed
voice recognition offers a higher performance which is sufficient for gender classification
as shown in figure 6.4 (a). In comparison with DET curve of the proposed voice recog-
nition named Signal Clustering Neural Network (SCNN) and PCMs, it can be seen from
in figure 6.4 (b) that accuracy of voice recognition has been improved by utilize SCNN.
– 66 –
6.5 Discussions
On the one hand, PCMs offered miss probability range from 2 to 20% and false
alarm probability range from 1 to 20% for subject to female classification, which ap-
propriate for the female speaker identification from several words utterance. Likewise,
subject to male classification, PCMs offered miss probability range from 2 to 12% and
false alarm probability range from 1 to 10%. On the other hand, SCNN offered miss
probability range from 0.5 to 15% and false alarm probability range from 0.4 to 1%
for female classification, which reduce the false alarm rates leads to high accuracy of
non-female classification. For the male subjects, SCNN offered miss probability range
from 0.3 to 1% and false alarm probability range from 0.5 to 50%, which is reduced
the miss rates leads to high accuracy of male classification. In contrast, the accuracy of
non-male classification is decreased since speech phase shift occurred and the variations
of feature sets are required in order to potential training the proposed voice recognition.
Table 6.6 The parameter optimization results.
Parameter NameSubject
Word Classification Gender Classification
Feature Extraction Method TSDWT TSDWT
Wavelet Level 6 6
Wavelet Function db15 db15
Network Topology H3 H3
Node Size in Each Layer {1000,4000,1000,18} {1000,4000,1000,2}
P(noise,dB) 71.25 62.50
Recognition Rate (%) 95.56 99.33
– 67 –
6.5 Discussions
False Alarm Rate
0 0.02 0.04 0.06 0.08 0.1 0.12
Hit R
ate
0.88
0.9
0.92
0.94
0.96
0.98
1
MaleFemale
(a) Receiver Operating Characteristic.
False Alarm Probability (%)
0.1 0.2 0.5 1 2 5 10 20 40
Mis
s P
robab
ility (
%)
0.1
0.2
0.5
1
2
5
10
20
40 MaleFemale
(b) Detection Error Tradeoff curves
Fig. 6.4 Performance Evaluation.
– 68 –
6.6 Conclusions
6.6 Conclusions
This chapter presented a new voice recognition method named Signal Clustering
Neural Network by simple Artificial Neural Network model with single channel micro-
phone and Wavelet Transform feature extractions. The experimental results proved
Wavelet Transform which achieved a high recognition rates up to 95% instead of Short-
time Fourier Transform feature extractions at noises up to 70 dB as in normal con-
versation background noises. The performance evaluation had been demonstrated in
terms of correct recognition rates, maximum noise power of interfering sounds, Receiver
Operating Characteristic and Detection Error Tradeoff curves. The proposed method
offers a potential alternative to intelligence voice recognition system in computational
linguistics and speech controlled robot application.
– 69 –
Chapter 7
Conclusions
This dissertation consisted of introduction of human speech and speaker recogni-
tion, Wavelet theory, Neural Network theory, paper of implementation of artificial neural
network and multilevel of discrete wavelet transform for voice recognition and paper of
reinforced voice recognition using distributed artificial neural network with time-scale
wavelet transform feature extraction which available in chapter 2, chapter 3, chapter 4,
chapter 5 and chapter 6, respectively.
First, the brief research’s history in automatic speech and speaker recognition was
presented in past 65 years in order to provide a technological perspective and an ap-
preciation of the fundamental progress that has been accomplished in this important
area of speech communication in chapter 2. Many techniques have been developed and
sufficient for exhibited robust recognition. However, many challenges have yet to be
overcome before we can achieve the ultimate goal of creating machines that can com-
municate naturally with humans. The satisfactory performance under a broad range of
operating conditions was required in order to archive machine can communicate nat-
urally with humans. This research focuses on the improvement of Feature Extraction
by using Wavelet transform instead of Fourier transform which described the reasons
of replacement in chapter 3. The speaker Adaptation and Speech Understanding is
improving by famous deep learning named Artificial Neural Network which described
chapter 4. The proposed system behavior can be characterized by using the new spec-
trum of the input signal. This proposed research aims to develop the ability to classify
– 70 –
command input signals, improve accuracy of voice command recognition and support
the voice recognition of economical robots which shown in chapter 5 and chapter 6.
In chapter 3, a visual representation of the spectrum of frequencies in a sound or
other signal named spectrogram was described. The advantageous development tools to
analyze the frequency components of the signal called Fourier transform and Short-time
Fourier transform was discussed. Short-time Fourier transform uses a sliding window
in order to calculate the spectrogram, which gives the information of both time and
frequency. Nonetheless, the length of window limits the resolution in frequency was
occurred lead to uncertainty information each frequency band. Representation as shown
in Heisenberg’s uncertainty principles. In order to solve window limits the resolution
issues, Wavelet transform potential alternative suggest a solution. Wavelet transforms
are based on small wavelets with limited duration. The translated-version wavelets
locate where we concern. Whereas the scaled-version wavelets allow us to analyze the
signal in different scale as shown in Heisenberg’s uncertainty principles.
In chapter 4, an introduction to fundamental of neural network, learning rules,
network architectures, mathematical analysis of these networks and their application to
practical engineering problems or potential solution were presented which motivated by
the recognition of the brain computes in an entirely different way from the conventional
digital computer. This models offers a potential alternative to intelligence solution such
as nonlinear regression, pattern recognition, signal processing, data mining, control
systems and real world problems as well as human voice recognition.
In chapter 5 presented an alternative voice recognition using combination of Arti-
ficial Neural Network and Multilevel of Discrete Wavelet Transform. The experimental
results proved Wavelet Transform was achieved to increases high recognition rates up to
95% instead of Short-time Fourier Transform feature extractions at noises up to 65 dB
as in normal conversation background noises. The performance evaluation was demon-
– 71 –
strated in terms of correct recognition rate, maximum noise power of interfering sounds,
hit rate, false alarm rate and miss rate. The proposed method offers a potential alterna-
tive to intelligence voice recognition system in speech analysis-synthesis and recognition
applications.
Finally, the chapter 6 presented a new voice recognition method named Signal
Clustering Neural Network by simple Artificial Neural Network model with single chan-
nel microphone and Wavelet Transform feature extractions. The experimental results
proved Wavelet Transform which achieved a high recognition rates up to 95% instead
of Short-time Fourier Transform feature extractions at noises up to 70 dB as in normal
conversation background noises. The performance evaluation had been demonstrated in
terms of correct recognition rates, maximum noise power of interfering sounds, Receiver
Operating Characteristic and Detection Error Tradeoff curves. The proposed method
offers a potential alternative to intelligence voice recognition system in computational
linguistics and speech controlled robot application.
– 72 –
Acknowledgement
Firstly, the author is most grateful and foremost to his advisor, Prof. Dr. Masahiro
Fukumoto, for his valuable supervision, supports, encouragements throughout the study
and mentoring me over the course of my graduate studies. His insight lead to the original
proposal to examine the possibility of Wavelet transform for feature extraction in voice
recognition and ultimately lead to the publish in an honorable Springer book ”Computer
and Information Science” series, “Studies in Computational Intelligence 656” in 2016.
He has helped me through extremely difficult times over the course of the analysis and
the revising of the dissertation and for that I sincerely thank him for his confidence in
me. I would additionally like to thank Assoc. Prof. Shinichi Yoshida for his support
in both the research and especially the life in Japan. His knowledge and understanding
of the machine learning fields has allowed me to fully express the concepts behind this
research. Also, grateful acknowledgements are also made to Prof. Dr. Toru Kurihara,
members of dissertation committee, for their valuable suggestions and comments.
The author wishes to acknowledges Prof. Lawrie Hunter and Prof. Paul Daniels
for his valuable guidance in research writing. the author wishes to acknowledges to Ms.
Sonoko Fukudome and Ms. Kubo Mariko, members of International Relation Center,
for their administrative supports and Japanese language instructions. The author also
wishes to acknowledge Kochi University of Technology, Ministry of Education, Culture,
Sports, Science and Technology (MEXT), and Japan Student Services Organization
(JASSO) for a great opportunity of financial supports.
This research would not have been possible without the assistance of the laboratory
member who constructed the experimental apparatus and built the foundations for the
data analysis. Sincere appreciation is also extended to all Japanese friends in Kochi and
– 73 –
Acknowledgement
to colleagues in the Signal Processing & New Generation Network Laboratory for their
useful technical experience sharing and kind technical assistance. The author sincerely
appreciates all of his Thai friends in Kochi for their friendships and goodwill.
Finally, I would like to thank my best friend, he is just like a brother. He definitely
provides me with the tools that I needed to choose the right direction and successfully
complete my dissertation and research’s paper. Thank you to Mr. Chiramathe Nami
who shared this journey with me. Without him, I could never have completed many
research’s paper. Also, I would like to thank Thomas J. Bergersen and Nick Phoenix
from “Two Steps from Hell” for his immeasurable bravery music during these two years
of studies in university. Thank you for makes me braver when I have fallen in the
dark. I would like to extend my deepest gratitude to my mother and my little cute
brother without whose love, support and understanding I could never have completed
this master degree. Thank you, Mom, brother, always and forever.
– 74 –
References
[1] S. Furui, “50 years of progress in speech and speaker recognition”, SPECOM2005,
pp.1–9, Patras, Greece, 2005.
[2] S.G. Mallat, “A theory for multiresolution signal decomposition: the wavelet rep-
resentation”, IEEE Pattern Anal. and Machine Intell., vol. 11, no. 7, pp 674–693,
1989.
[3] C. Valens, “A Really Friendly Guide to Wavelets”, 1999.
[4] R. Schneider and F. Kruger, “Daubechies Wavelets and Interpolating Scaling Func-
tions and Application on PDEs”, Technical University of Berlin, 22 November 2007.
[5] E. Johansson, “Wavelet Theory and some of its Applications”, Dissertation, Lulea
University of Technology, February 2005.
[6] W. Hereman, “WAVELETS: Theory and Applications An Introduction”, Disserta-
tion, University of Antwerp, 4-15 December 2000.
[7] R. Polikar, “The Wavelet Tutorial Second Edition”, Handout, Rowan University, 5
November 2000.
[8] C. Liu, “A Tutorial of the Wavelet Transform”, Department of Electrical Engineer-
ing, National Taiwan University, 23 February 2010.
[9] C.M. Bishop, “Pattern Recognition and Machine Learning (Information Science
and Statistics)”, Springer-Verlag New York, ISBN.978-0-387-31073-2, 2006.
[10] B. Krose et al, “An Introduction to Neural Network”, University of Amsterdam,
1996.
[11] R. Rojas, “Neural Networks, A Systematic Introduction”, Springer-Verlag, Berlin,
1996.
[12] M.T. Hagan et al, “Neural Networks Design 2nd Edition”, ISBN-0971732116, 2014.
– 75 –
References
[13] M.F. Møller, “A scaled conjugate gradient algorithm for fast supervised learning”,
Neural Networks, Volume 6, Issue 4, pp. 525–533 1993.
[14] Y Bengio, P Lamblin, D Popovici, H Larochelle, “Greedy Layer-Wise Training of
Deep Networks”, Advances in neural information processing systems 19, 2007.
[15] T. Matsui and K. Tanabe, “Comparative Study of Speaker Identification Methods:
dPLRM, SVM and GMM”, IEICE Trans. INFOMATION. & SYSTEM, Vol.E89–D,
No.3 March 2006.
[16] N. Roman and D. Wang. “Pitch-based monaural segregation of reverberant speech”,
Journal of Acoustics Sciety of America, Vol. 120, No. 1, pp.458–469, July 2006.
[17] Y. Sasaki et al, “Pitch-Cluster-Map Based Daily Sound Recognition for Mobile
Robot Audition”, Journal of Robotics and Mechatronics, Vol.22 No.3, 2010.
[18] “A propagation approach to modelling the joint distributions of clean and cor-
rupted speech in the Mel-Cepstral domain”, Automatic Speech Recognition and
Understanding (ASRU), 2013 IEEE Workshop, pp.180-185, Czech, 8-12 December
2013.
[19] R.F. Astudillo, “Exemplar-Based Speech Enhancement For Deep Neural Network
Based Automatic Speech Recognition”, Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference, pp. 4485-4489, Australia, 19-24
April 2015.
[20] R.F. Astudillo, “An Extension of STFT Uncertainty Propagation for GMM-Based
Super-Gaussian a Priori Models”, IEEE, Signal Processing Letters, Vol.20, No.12,
December 2013.
[21] “Handbook of the International Phonetic Association: A Guide to the Use of the
International Phonetic Alphabet”, Cambridge University Press, ISBN-0521637511,
1999.
– 76 –