isolated word speech recognition using fuzzy … word speech recognition using fuzzy neural...
TRANSCRIPT
Isolated Word Speech Recognition
Using Fuzzy Neural Techniques
by
Hui Ping
-4 Thesis Submitted to the College o f Graduate Studies and Research through the
Faculty of Engineering - Electrical and Computer Engineering in Partial Fulfillment o f the Requirernents for
the Degrse o f M s t e r o f Applied Science at the University of Windsor
Windsor. Ontario. Canada
1999
@ 1999 Hui Ping
National Library 1+1 of Canada Bibliothéque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. rue Wellington OttawaON K1A ON4 OnawaW K l A ô N 4 Canada Canada
The author has granted a non- exclusive Licence allowing the National Library of Canada to reproduce, loan, distribute or seil copies of thîs thesis in microform, paper or electronic formats.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
Abstract
Autoniatic speech recognition by machine is one of the most efficient msthods for man-
niacliine conimunications. Becaux speech waveform is nonlinear and variant. speech
recognition requires a lot of intelligence and fault tolerance in the pattern recognition
aigorithms. F w . neural techniques allow etliective decisions in the presence of
uncct%int>-. Cnnsrquently. the objective of this thesis is to study the f û z y neural techniques
Ibr the application in speech recognition. Two methods are proposed for isolated word
recognition using fuzzy pattern matching technique and Fuzzy c-means clustering technique.
I'hs algorithms are tested based on t u a LPC-based speech features: line spectrurn
frcqusncies and cepstral cosficients. It is shown that the fuzzy aigorithm is an efficient
approxh and c m provide reliable and accurate recognition results.
..- I I I
Dedicated to my family
for their love and support
Acknowledgements
1 \\.ould like to express my sincere gratitude to my thesis advisor Dr. H. K. Kwan. for his
suggestions. suidance. support and encouragement throughout the course of this research
\\-ork. I t has indeed been a privilege to work with him.
I ~ v i s h to thank rny department reader. Professor P. H. Alesander and rn!. estemal readrr. Dr.
Li\\-u Li. for thcir valuable advice tow-ard the fulfillment of the thesis work.
1 II-ould aIso Iike to thank al1 my friends in the iSPLab who have given me support during the
stud'- and research: Tracy Li. Halima El-Khatib. Wayne Chiang. Walter Jin and Jie Zhang.
Table of Contents
.Abstr;lct ............................................................... iii Dcdication .............................................................. iv
.Ackno~t.ledgements ....................................................... v
Cliaptcr 1 Introduction .................................................. 1
1.1 Background ................... ..,. ........................................................................... 1
1.2 Applications of Speech Recognition Technology ............................................. 3
1.3 Moti\-ation for the Rssearch .................................................................................... 3
1.4 Organization of the Thssis ..................................................................................... 5
Chapter 2 Literature Sumey on Speech Recognition ......................... 2.1 Introduction to Speech Sounds .......................... ... .............................................
2.1 . 1 Speech Production .........................................................................................
2 - 1 2 Speech Perception ..........................................................................................
............................................................................................. 2.1 -3 Speech Features 10
3.1.4 Representation of Speech Signai .............................................................. 12
2.2 Fundamental Speech Recopition Techniques ..................................................... 16
2.2.1 Classi ficarion ofSpeech Recognition ........................................................... 16
.............................................................. 2.2.2 Difticulties in Speech Recognition 18
.................................................................. 2 Speech Recognition Approaches 19
..................................... Chaptcr 3 Speech Feature Extraction 21
................................................................................... 3.1 Lincar Predictive Analysis 21
37 3.1 . 1 The LPC Mode1 ............................................................................................ -- ........................................................ 3 . 1.7 LPC Processor for Speech Recognition 28
................................................................................... 3 2 Line Spectnun Frequency 30
3.3 Cspstral Coefficients ........................................................................................... 36
Chapter 4 Fuzzy Neural Network for Speech Recognition .................. 40
4.1 F U Z Z ~ Logic ........................ ,. .......................................................................... 40
4.1.1 Background ................................... .... 40
4.1.2 Fuzzy Sets and Fuzzy Logic .......................................................................... 42
4.1 -3 Fuzzy System ................................................................................................. 44
4.2 Fuzzy Neural Networks ........................................................................................ 46
4.2.1 Neural networks for Speech Recognition ...................................................... 46
............................................................................. 4.2.2 Self Organizing Networks 49
4-33 Fuzzy Neural Systern .................................................................................... 51
4.3 Fuzzy C-Means Clustering ................................................................................... 54
4.3.1 Algorithm of FCM ................................................................................. 54
................................................................................................ 4.3.2 An Exarnple 58
4 -33 Sumrnary ..................................................................................................... 59
...................................... Chapter 5 Fu- Speech Recognizer 60
5.1 Issues on Implementing a Fuzzy Speech Recognizer ........................................ 6 0
5 .2.1 Time Norrnalization .................................................................................... 60
................................................................................. 5 2.2 Template Training
5 2 . 3 Recognition Network ...................................................................................
5.2 Speech Database ....................... .. .......................................................................
5.3 Simulations and Results .............................. .... ................................................ 71
.................... Chapter 6 Conclusions and Suggestion for Future Work 78
..................... G.1 Conclusions .... .......................................................................... 78
................................................................................. 6.2 Suggestion for Future Work 80
Vita Auctoris .......................................................... 85
List of Abbreviations
Arti ficial hreural Network
Automatic Speech Recognition
Fuzzy C-Means
Dynamic Time Warping
F u u y Logic
Fu- Learning Vector Quantization
F u u y Neural Network
Hidden Marke\- Mode1
Linear Predictive Coding
Line S p e c t m Frequency
Learning Vector Quantization
Self-Organizing Map
Figure 5.8: Speaker dependent recognition rate using LSF with
network 1 and network 2 ............................................................. 74
Figure 5.9. Speaker-independent recognition rate with FCM and hard means ............ 76
Figure 5.1 0: Speaker-independent recognition rate using LSF with
net~vork 1 and network 2 ............................................................. 77
List of Tables
Table 2.1 : Formant frequencies for eight vowels of mid-west Amencan English .......... II
T I 5 1 : Recognition rate for speaker-dependent recognition .............................. -72
Table 5.2. Recognition rate for speaker-independent recognition ............................ 75
Chapter 1
Introduction
1.1 Background
Automatic speech recognition by machine has been a part of science fiction for many
years. The early attempts tvere made in the 1950s by V ~ ~ O U S researchers. In 1953.
DaLk Biddulph and Balashek [27] designed the first isolated digit recognizer for a single
speaker at the Bell Laboratories. This system used a simple pattern matchincg method
u.itll templatss for each of the digits. -Matchhg was performed ~vith two parameters: a
frsquency cut based on separating the spectrum of the spoken digit into two bands and a
tùndan~ental frcquenq. estirnated by zero-crossing counting.
In 1961. Suzuki and Nakata [28] in Tokyo built a hardware vowel recognizer based on a
filter bank spectrum analyzer. In 1962. Sakai and Doshita of Tokyo University designed
a hardnwe phoneme recognizer. A hardware speech segmentor was used dong with a
zero-crossing analysis for different segments of the input speech to provide the
recognition resulr.
biosi of' the ab0L.e systems were implemented as electronics devices. However. speech
recornition could never anract so much attention until the flourish of digital cornputers.
Page I
The tïrst computer-based speech recognition systern kvas carried out in the early 60s.
Denes and Matthews p9] introduced the concept of time norrnalization in speech pattern
matching. In 1968. Russian researcher Vintsyuk [ ; O ] proposed the idea of dynamic
prograrnming methods of tirne alignment for speech patterns with different lengths. The
essence of this idea. ~vhich is caIIed DTW (dynamic time warping). is still widely used
for the current commercial products.
The 1970s and 1980s Lvere very active periods for speech recognition with a series of
important milestones:
Pattern recognition algorithms n-ere applied for the templats-based isolated worc!
recognition methods.
Continuous speech from large vocabularies was understood based on the use of high
1c.i~t.l knon-lsdge to compensate for the errors in phonetic approaches.
Speech analysis method based on Linear Predictive Coding (LPC) \vas uscd instead
of con\-entional msthods such as FFT and tiiter banks.
Statistical modeling such as the HMMs (Hidden Markov Model) n-ere developed for
continuous speech recognition
The neural net\\-orks (back propagation. learning vector quantization) with efficient
learning algorithms n-ere proposed for speech pattern matching
In rcccnt !-cars the speech recognition technology have begun to enter the real world in
our Iife. blore and more advanced algorithrns were adoptsd in this area. Fuzq neural
l so lufd Il 'ord Speech Recognif ion Using Ftc? .\ietrraf Techniques Page 2
techniques have aiso been applied to speech recognition and this field is growing and
de\feloping very fast.
1.2 Applications of Speech Recognition Technology
Currrntly. speech recognition systems are being devrloped for commercial applications.
One of the successful speech recognition systems is the Voice Recognition Cali
Processing (VRCP) system h m .4T&T. VPCP has a five-word vocabulary. and
automates operator assisted calls. AT&T also have a system knottn as Voice Interactive
Phone (VIP). with seven spoken commands replacing the touch tone codes. In this
sJestsnl. 94?6 of users Lvrre cornfortable with talking to the machine. and 84% of üssrs
preferrc-d the VIP system than the present system.
\!'ith cornputers becoming ever prssent in business. education. and governnlent. there is a
tremendous market for faster, more efficient man-machine interfaces. In the future. we
niIl be intensely using voice as input dong with the keyboard and morise. Most of the
\vindo\vs or othsr GUI operating systems-based applications will use speech recognition
to accept \*oice commands and conlrert voice into text.
.A s u m m a c of speech technologj. application areas are listed below:
Computer engineering: building a natural language interface to the computer
operating system or application software.
13-olarcd Ilr-ord Speech Recognition Using F t i ~ ~ Neural Techniqries Page 3
Program Developers: use pre-recorded voice-macros while developing a cornputer
program.
Telephone commerce (to replace touch-tone): telephone banking using \.oice
commands: order placement using ïoice to record incoming order data for the
customer service representatives.
Trlephony: hands-free dialing; comecting caller through a Company switchboard
\\.ithout human inten-ention: placing calls through -virtual' operator.
Physicians: record patient data: make records while doing observations or
perfonning operations.
Attorneys: use instead of secretaries: conduct online research.
1.3 Motivation of the Research
IiÏtli so much convenience that speech recognition could bring to Our life. there are
convincing reasons for researching and improving speech recognition technology.
Ho\ve\.er. achieving recognition is quite a difficult task. The complexity is due to the
nunibcr of the involved speakers. the variability of utterances. the comptexity of
lançuages. and the environrnsntaf conditions under which the speech recognition system
nwst operate.
Isolured Il'ord Speech Recognition Using Fti=,-1 Neural Techniqtres Page 4
Chapter / : /ntroducrion
n i e t\vo main concerns in speech recognition are to irnprove the recognition accuracy
and the processing speed. Therefore. the motivation of this research is to provide a
d i a b l e and efficient recognition method.
Beforr creating a general system to perform continuous recognition. this thesis deals with
isolatrd uvord recognition through the use of digital processing algorithrns and the
application of fuzzy neural techniques. Because of the uncertainty of speech waveforrns.
fuzz?. neural techniques are recognized as an efficient way to handle this problem. The
objective of this thesis is to utilize fuzzy neural techniques in designing a speech
recognition systern.
1.4 Organization of the Thesis
111 Chapter 2. A literature survzy is reviewed on speech recognition. It gives an
introduction to speech production and perception. speech signal features and fundamental
speech recognition methods.
Chapter 3 describes the dgorithm for speech feature extraction, cvhich is the first step in
the Lvhole process of speech recognition. In this chapter. the LPC analysis is discussed
and t~vo di f i rent LPC-based parameters -- line spectrum frequencies and cepstral
coefficients are pressnted for the use of speech recognition.
Page 5
Chap fer 1: Introduction
Chaptrr 4 presents the f u q Iogic and neural network theones for speech recognition.
The Fuzzy c-means algorithm is introduced for clustering the word ternplates.
In Chaprsr 5. a template-based fiizzy speech recognizer is describsd. It also indudes the
recognition results and analysis.
Chapter 6 gives the conclusions and suggestions for future research.
Isolu~ed J Iord Speech Recognition Using FIE-?) Neural Techniques Page 6
Chapter 2
Literature Survey on Speech Recognition
2.1 Introduction to Speech Sounds
2.1. I Speech Protlrrctiorr
Speech sound is produced by a set of well-controlled movements o f various speech
apparatus. Figure 2.1 shows a schematic cross-section through the vocal tract o f the
apparatus.
The vocal tract is a primary acoustic tube, wliich is the region of the mouth cavity
bounded by the vocal cords and the lips. As air is espelled from the lungs. the vocal
cords are tensed and then caused to vibrate by the airflow. The frequency of oscillation is
callsd the fundamental frequency. and it depends on the length. tension and mass of the
\-ocal cords. During this process, the shape o f the vocal tube is changed by different
positions of the velum, tongue. jaw and !ips [2]. The average length of the vocal tract for
an adult male is about 17cm. and its cross-section area can Vary in its outer section fiom O
to about 20cm'. Therefore. the vocal tract. as an acoustic resonator, wd1 determine
variable resonant frequencies by adjusting the shape and s i x of the vocal tract. The
resonant freqiiency is called the formant frequency or simply formant. The nasal tract is
isolcrred I lord Speech Recogn ilion Using FU,--3' Neural Techniques Page 7
Chaprer 2: Lirerarure Sumerl on Speech Recrnirion
an ausiliary acoustic tube that can be acoustically cooperated with vocal tract to produce
nasal sounds.
Figure 2.1 : Schematic vieu- of the human speech apparatus
Various speech sounds are producrd not only by adjusting the shape of the vocal tract.
but also the type of excitation. Besides the airflow from the lung. the escitation could
corne tiom some other sources: the fricative excitation. plosive excitation and whispered
excitation [3].
2.1.2 Speech Percepfioti
As the \focal system can produce speech sounds. the auditory system is capable of
dctrcting the change in air pressure of audible sounds [2]. Figure 2.2 shows a cross-
- - -
/soli~c'rJ Il urd Spccch Recognirion Using F U = ~ . h'ewal Techniques Page 8
Chaprer 2: L iferarure Survev on Speech Recomirion
section diagram of human ear. The sar consists of three parts: the outer ear. the middle
car. and the imer ear [26]. The outer ear collects the sound waves and passes the air
pressure Lariations to the eardrum. The middle ear is an air-filled cavity. which serve as
a mechanical amplifier and transfomi vibrations of the eardnim into oscillations of the
tluid tilled imer ear. The imer ear then converts the mechanical vibrations into elsctncal
potentials that go to the auditory nenre and the cortex.
The hurnan car is most sensitive to frequencies of the range from 1000 to JOOOHz. Most
speech infornlation is covered within thsse frequencies. It is shown by experiments that
human ears are largely phase insensitive. The basilar membrane is only deformed when
the stapes pushes on the oval window [l]. thus very little information is available for the
brain to determine the ~vaveform's phase. This fact couId be applied to speech
recognition to reduce the amount of data in the encoded waveform.
Outer Middle Inner ear ear ear
Figure 2.2: Cross-section of the human ear
lsolared If brd Speech Recogtririon Using F r c y Nerira! Techniques Page 9
C h a ~ r e r 2: Lirerature Sunwv on Speech Recoenition
2.1.3 Speech Featrires
The speech recognition can be divided into nvo processes: feature extraction and pattern
recognition. Feature extraction is responsible for searching the speech characteristics and
storing them for the second process: pattern recognition. In order to identify the speech
characteristics accurately and efficiently. it is necessary to investigate the features and
classi ticat ions of speech sounds.
.An: natural Ianguage, including English* is based on a set of distinguishable and
mutuaIl>- esclusive primary units. which are called phonemes. Al1 the phonemes are
relatsd to different articulatory gestures of a language.
There are several ways to classi@ speech sounds [ l , 21. According to the type of
cscitation source of phonemes. speech sounds can be classifred into the following
catsgories:
lbiced sounds (/a/. /ci/) occur when air pressure pushes the vocal cords open and
causes tliem to vibrate. The vibrating cords modulate the air Stream frorn the lungs at
a rate that could be as low as 60 times per second for some males to 500 times per
second for children. The peak amplitude of voiced sound is much higher than that of
the un\-oiced sound.
Isolarrd IfardSpeech Recognition üsing Ftcy Neural Techniques Page IO
k r s d sounds such as /rd. /n/ are also voiced. However, the nasal cavity is involved
togethsr u-ith the vocal cavity during the utterance. Part of the airflow is diverted into
the nasal tract by opening the \relum.
Fricurii-es are generated by esciting the vocal tract with turbulent flow created by
airtlow through a narrow constriction. For esample. the sound /f/. /s/ and /SM are
f'ricatil-es.
ibiced fi-icarives occur when the vocal tract is escited sirnultaneously by both
turbulence flow and vocal vibration. The sounds /z/. izh,' and /v/ belong to this
categor'-.
Plosiiees arc produced by esciting the vocal tract with a rapid release of pressure by
the constrictions of lips or teeth. The plosives /t/. Ad are voiceless, while /W. !di are
\,oicsd.
.!fi-iccrrii~ sounds are produced by gradua111 releasing a completely closed and
prcssurized vocai tract.
rl71isper-ecl sounds are escited by airflow mshing through a small triangular opening
bst~j-sen the al-tenoid cartilages at the rear of the neariy closed vocal foids.
For \.ou-el sounds. because the vocal tract remains relatively stable. three or four
resonance frequencies (fomants) c m usually be detected from O to 3KHz. Therefore. the
1-o\vel sounds c m be characterizcd by the two first fermants, where the third and fourth
tonnants arc less discriminative. Table 2.1 shows the three first mean formant
frequencies for eight vo~vels of Mid-West American English.
Chapcer 2: Lirerature Survev on Speech Remmition
Tabk 2.1 : Formant fiequencies for eight vowels of Mid-West Amencan English
( M e r Ladefoged. 1985 [3 11)
2.1.4 Represerrtatiorr of Speech Signal
.-\ speech signal can be broksn into several small components: phonemes. diphones.
syllablrs or words. where a phoneme is a minimal unit of speech sound. However. it is
practically difficult to identify an individual phoneme due to the overlapping of
phonenics. In automatic speech recognition. isolated word is used as the minimum unit
brcause it is relatively easisr to separate it within a sentence or phase.
Speech is a sloivly time varying activity which c m be simply graphically displayed b>* its
\va\.eform. The waveforrn is created by air pressure controlled by the lunps. vocal tract.
tongus and rnouth. However. the time domain representation is much less popular than
the frequency representations. This is because the hurnan ears perform some type of
frequenq. analysis rather than time domain analysis during the auditory process. and it is
/solcird Il'ord %cech Recognirion Using Fzcz? ;Lreural Tecltniqrres Page 12
found that the hurnan ear is much more sensitive to the magnitude spectrum than the
phase information of the speech signai.
1 I O 0.2 0.4 0.6 0 8 1
Time (sec)
Figure 2.3 : LVa\-eform of' the sentence "please log in"
Thc most popular representation of a speech signa1 is the spectrogram. which is a three-
dimensional representation on the time-frequency domain. The introduction of the
spectrogram provided a way to produce a display of the time varying spectral
characteristics of speech. An esample of spectrograii is shown in Figure 2.4. The
\-crtical asis represents frequenc). while the horizontal corresponding to time. The
darkness shows the signal energy at a certain time and frequency, and the location of dark
arras change while the pronunciations move from one vowel to another inside the isoltzied It'ord Speech Recognition Using FU=-?' Neural Techniques Page 13
utterance. Thersfore. the formant frequencies of the vocal tract show up as dark bands in
the diagram. For example. the first nvo dark bands in "please" are located around 3OOHz
and 1100Hz: while they are 600Hz and 1 OOOHz in word "log".
--
O 0.2 0.4 0.6 0.8 1 1.2 Tirne (sec)
Figure 2.3: Spectrogram of the sentence "please log in1'
GrnzraIl>.. \.oiced rsgions are featured by a striated appearance due to the periodicity of
the \\.aveforrn. while unvoiced regions are more evenIy filled in. This phenornenon is
slio~vn in Figure 2.5. which gives the waveform and spectrogram of a sentence "çtarting
ro do\\-nload". It is obvious that there are dark bands for voiced region and lighter color
is distributed for unvoiced regions. This is in coincidence with the fact that onIy the
\-oiced sounds have formant frequencies.
/.soiulc>Lj I Ii;)rd Speech Recognirion Using Fit=,?: iVeztral Techniques Page 1-4
Chamer 2: Lirerature Survar on Speech Recognition
"O 0.5 1 Tima (sec)
Figure 2.5: Waveform and spectrogram of the sentence "starting to downioad"
-
fso /tri L J ~ I lord Speech Recognition Using Ftrz,?. Nef va l Techniques Page 15
Cha~ter 2: Lirerature Survev on S~eech Recopnirion
2.2 Fundamental Speech Recognition Techniques
2.2. l Class~jÏcation of Speech Recogniliorr
Automatic speech recognition c m be classified into a number of different categoriss
depending on different issues:
1. Thc manner in which a user speaks. Usuaily there are three recognition modes bzst-d
on the spsakin, O manner:
IsoIated word recognition: The user speaks individual words or phrases from a
specified vocabulary. IsoIated word recognition is suitable for comrnand
recognition.
Connected word recognition: The user speaks fluent sequence of words with
smalI spaces between words. in which each word is from a specified
\.ocabulaq (e.g.. zip codes. phone numbers).
Continuous speech recognition: The speaker c m speak fluentll. with a large
\.ocabulary.
3. The number of users:
Speaker dependent: The users of a recognition system only consist of a single
speaker or a set of knorvn speakers.
Speaker independent: arbitrary users will use the ASR system in this case.
Speaker adaptive: The system will customize its response to each individual
speaker while it is in use by the speaker.
The s i x of the recognition vocabulary:
Isoiirrrd IIord Speech Recognirion Using F t q Neitral Techniqztes Page Id
Chapter 2: Literarure S u n a . on Speech Reco~eni~ion
.4 small vocabulary system only provides recognition capability for a small
.A large vocabu1ax-y system is capable of recognizing u-ords among a
vocabüilary containing up to 1 O00 words.
4. The degree of dialogue between the human and the machine. including:
One-n-ay conununication in which each user spoken unit is acted upon.
System drileen dialog systerns in m-hich the system is the onl). initiator of a
dialog. requesting information from the user via verbal input.
Natural dialogue systems in xvhich the machine conducts a conversation ~vith
the speaker. solicits inputs. acts in response to user inputs. or even tries to
cIarify ambiguity in the conversation.
Brcausc speech waveform is nonlinear and dynamic. speech recognition is an inherentl!.
dif'flcult task. There are several main variabilities of speech signal including \\<thin-
speaker \-ariability. across-speaker variability. transducer and transmission variability.
langunge complexity. and the environmental conditions under which a speaker is talking.
Ilïrhin-specrlier variability is caused by inconsistent pronunciation. speaking speed and
ciifferant emotions when the words or phrases are spoken by same speaker.
l.sol~rèd Iferd Speech Recognirion Ushg Fzcq* iVretrral T2clrniqztes Page 17
Chamer 2: Lirerarure Sririw on Speech Recognition
.i cross-speaker 1-ariability is due to the ph ysiological di fferences. regional accents.
foreign languages, etc. The physiological correlates are associated with the size and
coniïgurcltion of the components of the vocal tract of each individual. The variations in
the \-ocal tract can cause different resonance frequencies (fomants) and pitch frequency
of the same ~vords.
Trcrnsdztcet- am2 rrunsrnission vcrriabilis) is because the words are spoken over different
rnicrophondhandsets and the speech signal could be vansmitted by al1 kinds of
conimunication systems (telecommunication networks. cellular phones. etc.). in which
~inespccted noises are introduced into the signal.
Language compIesity makes speech recognition an estremely difficult job. So fàr. the
task of speech recognizers is simplified bj. Iimiting the number ofpossibIe utterances by
the imposition of semantic consuaints. On the other hand. we shall obey multi-
disciplinaqr natures of speech signal and be adaptive to the language complelrity because
spcech is a completely natwal activity of human beings.
E ~ i . i r o m m n r a I condition is also a main concem of speech recognizers while real
applications usually are conducted in adverse conditions which ma. drastically degrade
the s)-stem performance. Therefore. it is necessary to present robust recognition methods
for dcaling nith reasonable noise or distortions of the speech signal.
/sulu r d I l,*ord Speech Recognirion a i n g Fzizq? Nerrral Techniques Page 18
Chuprer 2: L irerature Surïev on Speech Rrcocnirion
2.2.3 Speech Recognition Approach es
2.2.3.1 Acoustic-Phonetic Approach
The zarliest approaches of speech recognition were based on the theon of acoustic
phonetics to tind speech sounds and provide phonetic characteristic Iabels for these
sounds. These esisting finite, distinctive phonetic units in spoken language couid be
characterized by a set of acoustic properties which are manifest in the speech signal over
tinlz. The first step in the acoustic-phonetic approach is to segment the speech signal into
stable acoustic regions and label them. followed by adding one or more phonetic labels.
The second step is to detennine a valid word from the phonetic label sequences based on
rhs first stsp. Because the difficulty of getting a reliable phoneme lattice in step one. the
acousric-phonstic approach has not been widely used for most commercial applications.
2.2.3 2 Pattern ~Matching Approach
The pattem matching approach is based on pattern recognition algorithms that require
pattern templates before recognition [2]. It has two steps: Pattern training and pattern
coniparison (Figure 2.6). Pattern training is responsible for establishing consistent
specch pattern representation for a set of known training samples. There are several
methods for training such as statistical models (e-g.? hidden Markov mode]) and
clustering training (learning vector quantization, fuzzy c-mean clustering). The second
Isolorcd II Qrd Speech Recogn ilion UsUrg F z z y h'eural Teclmiques Page 19
Chaoter 2: Lirerarure Surva? on Speech Recognirion
step. pattern cornparison compares the unknown speech with each template. and
detennine the identity of it by the matching algorithrns.
Speech Anal>.sis
Pattern Matching t Decision 4:
Figure 2.6: Block Diagram of Pattern Recognition Recognizer
3.2.3 -3 Computational Intelligence approach
The computational intelligence approach is a hybrid method of the acoustic-phonetic
approach and the pattern matching approach. Generally a neural network is applied to
integrrite the howledge of speech for segmentation and labeling. and intelligent tools are
iised for learning the relationship m o n @ phonetic events. This method has been pro\ped
to be a \.en. promising area for speech recognition and was widely used in commercial
applications.
/solarccl Il 'ord Speech Recognirion Using FE? Neural Techniqzm Page 20
Chamer 3: Speech Fearure Errracrion
Chapter 3
Speech Feature Extraction
3.1 Linear Predictive Analysis
Linear prsdictive analysis has been one of the most powerful speech anaiysis techniques
sincs it usas introduced in the early 1970s. Primarily it is a tirne-domain coding method
for low bit rats speech storage and transmission, but it can also be used for providing
fiequency-domain parameters (Iike formant frequency, bandwidth etc.) on the time basis
of the speech signal. In the apptication of speech recognition. these parameters can sente
as the speech characteristics representation.
For speech recognition. linear predictive coding (LPC) has several advantages o\.er 0 t h
techniques including:
LPC is capable of providing accurate estimates for the speech spectmm envelope.
It can be used to separate the excitation source properties of pitch and amplitude
from the \.ocal tract filter which controls the phoneme articulation and is directlq.
related to the produced speech sounds.
LPC is easy to be implemented by either sohvare or hardware because it is
mathematicaIIy precise, simpIe and straightfonvard.
/so/afc J Il-brd Speech Recognirion Using F t ~ y Neriral Techniques Page ZI
0 The LPC algorithm is computationally efficient. The required arnount of
computation of LPC is much Iess than diat of other techniques such as the fast
Fourier transfoml or filter bank model.
3.1. I Tire L PC Mode1
The LPC is a mode1 based on the vocal tract of human beings [4]. The basic idea of LPC
mode1 is that a speech sample x(n) c m be predicted by a linear combination of several
past sample values of speech:
~ ' ( 1 2 ) = a,x (n - 1) + a.s(n - 2) c ... + n p x ( n - p ) f3.1)
Where a,. ri?. ... . a, are called linear predictive coefricients. and they should be optimized
to minimize the prediction error betwcen the actual signal and the predicted values of this
saniple. u-hich is:
e(iz) = x ( n ) - -Y' ( n )
Although the speech signal is nonlinear and quite variant. the speech waveforrn over a
short prriods of time (around 10 to 30 msec j still remains roughl y invariant. Therefore.
the LPC coefficients can be re-calculated to minimize the mean squared prediction error
fsolarëd Ilord Speech Recognirion Using Fuzzy Neural Techniques Page 22
Chumer 3: Speech Featztre Ertracrior~
o\.er a short frame of the speech waveform. with each frarne segmented to a length of
around 10 to 30 msec,
Transforming the predictive error in equation 3.2 fiorn time domain into z-transfocm
Therefore. the transfer function behveen the speech sample and the prediction error could
bs \\-rittsn as:
U'hsn the LPC mode1 is applied to a speech signal. the predictive error E(z) can be
identiticd as the impulsive excitation of the vocal tract. while the al1 pole system H(z)
reprcsents the vocal tract modsl. This is how Iinear prediction separates out the
excitation properties of the source from the vocal tract filter: the source parameters are
deril-ed tiom the prediction error. and the ocal al tract filter is characterized by the linear
prsdicrive coefficients. Based on the analysis esperiments. the excitation source is
sssçntiall>. a quasi-periodic pulse train for voiced speech signals. and a random noise
signal for un\.oiced sounds.
X spcech synthcsis mode1 is built in Figure 3.1 based on the LPC model. The normalized
excitation signal u(n) is set to be either a quasi-periodic impulse train o r a random noise
/solnt cd Word Speech Recognition Using Frc? Xeural Techniqzres Page 23
C h q t e r 3: %eech Feature Extrocrion
signal (depending on the voiced/unvoiced determination). The appropriate gain of rhe
source G is estimated from the signal. and the scaled source is fed as input to a digital
tiltsr that represents the vocal tract model.
Pitch Period
Random Noise G Generator
Voiced/Unvoiced Impulse Train Swïtch LPC Coefficients
Generator
Figure 3.1 Block diagram of LPC-based speech synthesis model
Therr are tlirre basic algorithms to compute the LPC coefficients which could minimize
the prediction error over a speech frame [3]:
The autocorreIation method
The covariance method
The Iattice method
, ~("1, *
/solarcd Il hrd Speech Recognition Using Fu=,- Neural Techniques Page 2-1
7 t
Time-Varying Digital Filter
.4mong these three methods. the autocorrelation method is the most cornrnon used
method for linear predictive andysis. Defining the autocorrelation coefficients of speech
sarnples are given by:
Thsn the Iinear prediction coefficients can be computed using the Durbin-Levinson's
recursivs algorithm as shown beIow [2 ] :
u-herc the final solution of LPC coefficients are given as a,,, = a,,,"'. for I S m <p.
In speech recognition, the LPC model is used as a characteristic model for a speech
signal. Figure 3.2 and 3.3 gives the cornparison between the original speech power
spectrum and the magnitude spectrurn of the LPC model. It's obvious that LPC provides
a good approximation to the vocal tract spectral envelope. in which the information of
formant frequenc y and magnitude are included for speech recognition.
Isol~~red Il'ord Speech Recognirion L'sing Ftczy Aretuai Techniques Page 25
Chaprer 3: Soeech Feamre Euracrion
c 1 oaO MOQ m m mm rm im
Frequency (Hz)
Figure 3.2 (a) Original power specmm (b) Magnitude of LPC mode1 for phoneme / i l
ISOIUI~LI Il'ord Speech Recog~~irion Using F c y Netiral Techniques Page 26
Chapcer 3: Speech Feature Ertrucrion
Figure 3.3 (a) OriginaI power spectmm (b) Magnitude of LPC mode1 for phoneme /O/
/solarrd H'ord Speech Recognition Using Fz/~yv Neural Techniques Page 2 7
Chaprer 3: S ~ e e c h Fearrrre Ertracrion
3.1.2 L PC Processor for Speech Recognition
The LPC technique is used to build a front-end processor for a speech recognition systsm
to process a speech signal. s(n). as shown in Figure 3.4.
Preernphasis Frame xdn). Windowing Blocking
- I Conversion
Figure 3.4 Block diagram of LPC processor
The LPC processor includes follo~ving basic steps:
1. Pr~.entphnsis: A low-order system is applied to the speech signal in order to
spectrally flatten the signal and to make it less susceptible to finite precision effects
for the signal processing. The rnost wvidely used preemphasis filter is a first order
system:
/ . so /LI~c '~ IIord Speech Recognifion Using F i c ~ Neziral Techniqzres Page Z R
Chwrer 3: Speech Fzarure Errracrion
- 1 H ( z ) = I - a z , 0.85 l a 5 1 ( 3 -7)
~vhere the parameter a is usually set to be 0.95. Afier applying this filter. the output
s'(n) and input s(n) have the following relationship in the tirne domain:
~ ' ( 1 7 ) = s(n) - us ( n - 1) ( 3 -8)
3. Ft-cme Blocking: The preemphasized speech signal sl(n) is segmented into small
fianles. with N samples for sach frarne. Between the adjacent frames. there's b1
saniples overlapping to prevent the spectral discontinuous afier blocking.
3. Il,Ïmioii.N~g: Afier blocking the frames. a window is applied to each frarne to
minirnize the spectral discontinuities at the b e g i ~ i n g and the end of the speech
frams:
S, ' ( n ) = w ( n ) x , O?)? O I r z I h ; - l
.A t>rpical n k i o w is the Hamming window:
4 . LPC mrcdysis: For each frame. the LPC coefficients are calculated according to the
rscursive equation 3.6.
5. LPC prirrrnrerer conversion: In general. direct quantization and application of LPC
coefficients is inefficient and unreliable because the LPC coefficients are too dynamic
and a small quantization error could cause the entire filter to be unstable and
I.sul~~~c.cl I I 'urd Speech Rccognirion Using Fut? iCéuraI Techniques Page 2 9
Chaprer 3: Speech Fearure fitrocrion
inaccurate. Due to this weakness of LPC coefficientst some other related coefficients
are considered, such as the reflection coefficients. cepstral coefficients and line
spectral frequencies (LSFs). In this thesis. line spectral frequency and cepstral
cocficients are used as the extracted speech features. These two parameters are
described in the following section.
3.2 Line Spectrum Frequency
The line spectrum frequency \vas first proposed by Itakura in 1975 [6] as an alternative
paramctric representation for the LPC model. In the context of speech coding. LSF has
b w n stioivn to have better quantization and interpolation properties than other
representations such as reflection coefficient and log area ratio of the LPC model. Also.
a number of researchers have shown that a speech recognition system c m benefit from
thcss ad\-antages of LSF [7. 8.9. 101.
.-il,noritlrrrr ami Propertivs of LSF
In the LPC analysis of speech. assuming a speech frarne is modeled by an all-pole filter
H(z) = I 1-1 (1) m-ith order p. where .4/$ is the inverse filter given by:
Isolarcd Ilord Speech Recognition Using FI IZZ~ f i l t ra ! Techniques Page 30
Cha~rer 3 .- Speech Feature Errracrion
The LSF is a represented by mapping the p zercs of A(z) ont0 the unit circle through a
pair of (p+ 1) order polynomials P(z) and Q(z):
-(p - I ) Pfz j= . - l ( z ) - z .4(z-I)
These p~l~~nornials ccm be shown to have some interesting properties. The first is that al1
the zeros of P(z) and Q(z) lie on the unit circle and they are interlaced with each other.
Secondly. the frequencies tend to be clustered near the format fiequencies: when the P(z)
and Q(z) frequencies are close. it is Iikely that the original A(z) zero was close to the unit
circle. and a formant frequency is likely to be located between the corresponding
frcquenc~. pair. Nso. the closer one pair is, the sharper the formant will be. Thus. the LSF
coulci be utilizsd as the frequency features for speech recognition systems. Figure 3.5
sho\vs the spectrum and LSF of a speech segment. which dernonstrates the above t\vo
proprrtiés. Fiyure 3.6 and 3.7 give the LSF plots of two isolated words.
LSF have attracted much interest because they are good representations of LP systems.
and t).picnll>. result in quantizers having either bener representation or using fewer bits
for equi\.alent representation than reflection coefficient quantizers.
- .. - - - . -.
isoiared Ij'ord Srirech Recognition Using F E Z Neural Techniqtres Page 3 /
Chamer 3: S ~ e e c h Featzrre Ertracrion
1 O 2000 4000 6000 8000 10000 12000
Frequency (Hz)
Real part
*: LP poies
O: P(z) zeros A
-: Q(z) zeros
Figure 3.5 LSF and LP poIes in the z plane of phome /O!
lsola!d IlOrd Speech Recognition Using FIET Netrral Techniques Page 32
Chamer 3: S~eech Fearure Ertracrion
-80 2000 4000 6000 8000 10000 1
Frequency (Hz)
Ls? frequencies and LP poies in the z-plane
I i
-0 5 O O. 5 Real pan
*: LP poles
O: P(z) zeros - -: Q(z) zeros
Figure 3.6 LSF and LP poles in the z plane of phonme /il
/.soltrfed I I brd Speech Recognilion Using F z r z ~ Necrral Techniques Page 33
U Order o f LSF
Figure 3.7 LSF of word "no"
Line Spectrum Frequency of 'call"
Frame - U Order of LSF
Figure 3.8 LSF of word "call"
- --
lsolafed Il'ord Speech Recogntrion Using Fczy Neziral Techniques Page 34
Chaprer 3: Soeech Feature Ew-action
Figure 3 -9 LSF of word "hangup"
Line Spectrum Frequency of "halima"
\ . . . I . 10 . , .
\
. '. % /6
8
Frame 2 4 - U Order of LSF
Figure 3.10 LSF of word "Halima"
/solarrid Word Speech Recognition Using F u z z Neural Techniques Page 35
C h a ~ t e f 3: Speech Fearure Eurroction
Cepstral Coefficients
CepstraI coefficients have been proved to be another efficient and robust feature set for
speech recognition. Origindly, the cepstnim of a speech signal x(n) is defined as the
Fourier transform of the logarithrn of the magnitude of the specuvm X(dtV):
Based on the LPC model. by applying the smoothed magnitude as the magnitude
spsctrum. cepstral coefficients can be derived directly fkom the LPC coefficient set with
the recursive formula:
Prop r rtirs of Cepsf rai Coe fflcients
To tnake use of the cepstral coefficients properly for speech recognition, it's necessary to
kno~v the properties of thern:
Most information of speech signal is represented by the Iower numbered cepstraI
coefficients. and the firsr p coefficients can uniquely determine the all-pole filter of
LPC model
isolared IVord Speech Recognition Using Fzcy Netiral Techniques Page 36
Cha~rer 3: Speech Feature Errroclion
Cepstrum is a decaying sequence. under regular conditions, the variances of
coefficients (escept co) are essentially inversely proportional to the square of the
coefficient index (Figure 3.1 1 - 3.14)
Because the cepstrum has infinite index numbers, only the first 10 to 30 coefficients are
taken for representing the speech feature based on the above properties of cepsual
coefficients.
Weigh trd Cepstral Coefficients
The \variance of cepstraI coefficients is inversely proportional to the square of the
coefficient index as fo l lo~-s [ 2 ] :
The cepstral coefficients can be normalized with the index m, to balance the contribution
Srorn crich cepstral coefficient. then the weighted coefficients become:
- cm = mcm. I L r n l L (3.17)
.A more coniplicated weighting function can be applied for de-emphasiszing the
coefficient around rn=I und i:
- --
/sularc.d I : brd Speech Recognition Using FIIZZJ, Nezrral Techniques Page 3 7
Cepdnl Coficients of 'no'
Frarne L
Oder of Cepstrum
Figure 3.1 1 Cepstrai coefficients of word
Cepstral Coeficients of 'call'
"no"
u Order of Cepstrum
Figure 3.12 Cepstral coefficients of word "call"
/.so/ared I f Ord Speech Recognirion Using F Z L ~ Neicral Techniques Puge 38
Chaorer 3: Speech Feafure Errracrion
L
Order of Cepstral Coeficients
Figure 3.1 3 Cepstral coefficients of word " hangup"
Fnme
Figure 3.14 Cepstral coefficients of word "halima"
isoluretl Ilord Speech Recognition Using Fzc-y Neural Techniques --
Page 39
Chamer 4: FLZZ Neural Nenvork for S~eech Recognition
Chapter 4
Fuzzy Neural Network for Speech Recognition
4.1 F u z q Logic
4.1.1 Background
Fuzzy sets were introduced by Zadeh [32] in 1965 as a new way to represent and
manipulate data with uncertainty and fuzziness. in the old paradigm. fûzziness was
considered unfavorable because of the expectation for scientific precision and accuracy.
However. f q interpretations of data is a naturai and intuitively plausible way to
formulate and solve a lot of problems in our everyday life. For example. expressions
with uncertainty like "hot coffee". "hea\y objects". and "warm weather" are fuzzy
interpretations.
Although both fuzzy sets and statistical theory c m deal with uncertainty, fuzzy sets are
quite different fiom statistical rnodels in some ways. Probabilities represent the
likelihood of a certain event with a distribution arnong ail the events. while a fuzzy set
represents the applicability of the element to the set. In another word, the fùzziness
provides more uncertainty that can be found in the meanings of many words fiom
human's thinking.
Isolared IVord Speech Recognition Using Fu==y Neural Techniques Page 40
Cha~ter 4: FLT Neural ,Vetwork jor Speech Recopnition
Today. we have witnessed a rapid growth in a variety of applications of fuzzy logic. The
applications range fiom consumer products such as washing machines. cameras.
camcorders. and microwave ovens to industrial process control. medical instrumentation.
pattern recognition. decision-support systems' and portfolio selection. As we know.
communication by speech is a natural activity of hurnan beings and contains a lot of
uncenainty during both the speech production and reco-~t ion process. The application
of fuzzy logic to speech recognition actually simulates the way that people understand
rach other every day. The reasons why hzzy logic can be applied to speech recognition
are described as foliowing:
Fuuiy Iogic is conceptually easy to understand. The mathematical concepts
behind fuzzy reasoning are very simple. What makes fuuy attractive is the
-'naturalness" of its approach and not its far-reaching complexity.
F u z y Iogic is flexible with tolerance for imprecise data. Everything is imprecise
if .ou Iook closely enough. but more than that. most things are imprecise even
undsr careful inspection.
9 Fuzzy Iogic can mode1 nonlinear fimctions of arbitrary complexity. A fuzzy
system can be created to match any set of input-output data. This process is made
particularly easy by adaptive techniques Iike ANFIS (Adaptive Neuro-Fuzzy
Inference Systems).
Fuzzy logic is based on natural language. The b a i s for fuzzy logic is the basis for
human communication. This observation underpins many of the other staternents
about fuzzy logic. NaturaI language. which is used by o r d i n q people on a daily
Isolared If'ord Speech Recognition Using Frcy Neural Techniques Page 41
Cha~ter 4.- FIL-,~: Ne~wai ~Vetwork for Sueech Recoqnition
basis, has been shaped by thousands of years of human history to be convenient
and efficient. Sentences witten in ordinary language represent a triurnph of
sfficienr communication.
Fuzzy sets are a super-set of classical sets. In a fuzzy set. each element is associated with
a real \.due which represents the degree of membership of the element in the closed unit
in tend [O. 11. However. in classicai crisp sets. al1 element c m only be classified as "O"
or "1 ". When al1 elements in a set have either complete membership or complete non-
rnembership, the fuzzy set reduces to a crisp set.
Suppose a fuzzy set A is a subset in space X which admits partial rnembership. It is
delincd as the ordered pair A = { S . m..l(s)). where 'c EX and O I m ~ ( s ) I 1. Every f u z q
set consists of the three parts: a horizontal axis x specifying the population of sets: a
iw-tical membership a ~ i s rn&) which specifies the membership degree of each element:
and the surface itself to provide a one to one connection between the elements and their
corrcsponding membership degree.
For esample, let hzzy set X represent the concept of "tall" for women over 20. Women
5 feet or less than 5 feet have no rnembership in the set "tall", while women over 6 feet
have total membership. To detcnnine the membership for a specific height, the height is
fsolared It'ord Speech Recognition Using F q Neural Techniques Page 42
Chorirer 4: Fu- Neural ,Venuor& for Speech Recognilion
first found on the horizontal mis, then following the membership degree fùnction. the
value of membership will be located from the vertical auis. Figure 4.1 iIlustrates this
exarnple for fiizzv set "tall". while heights between 5 feet and 6 feet are proportionally
distributed.
5 6 s (feet)
Figure 4.1 Membership h c t i o n of fuzzy set for the concept "taIl"
The ideal f u u y sets representing a concept could be further espanded by linguistic
\-ariables. A Iinguistic variable is assigned to a f u q region consisting of a set of fuzz>,
sets. Figure 4.2 shows an esample expanded from the exainple in Figure 4.1 for the
concept "height". The variable consists of three fuzzy sets: short. medium and tall. The
horizontal axis specifies the base variable of height, and the degree of membership in
each fuzzq- set are determined by the vertical a ~ i s .
lsolart.d Word Speech Recognition Using Fuqv Neural Techniques Page -13
3 5 6 x (feet)
Figure 4.2 A linguistic variable of "height"
F u u > - systems use fuzzy set theory to deal with hrzy or non-fùzy information.
General1)-. a fuzzy system consists of a fuuification subsystem. a fuuy inference engine.
a f u z q rule base and a defuzzifier as shown in Figure 4.3. The fuzzy rule base and fùzzy
inference engine is the core of the hy-rule-based system. A h y nile c m be
espresssd by a set of f ù q inference rules in the form of "IF s is A THEN y is B" [19].
[?O]. The inference engine then implements a f u q inference algorithni to determine the
fùzzy output from the inference mIes and the inputs.
lsolured I Vord Speech Recogn ilion Using FUZY Neural Techniques Page 44
Chaprer 4.- FII,"~~ Neural Nehvork for Speech Recoqnirion
Xote that a given input may sirnultmeousIy be a member o f more than one set within a
single fuzzy region. The inference engine interacts with the mle base and uses the inputs
to determine which rules are applicable. The outputs are a set of fùzzy sets defined on
the uni\.erse of possible outputs which will be defuzzified to generate crisp outputs.
F u q Rule Olj] Figure 1.3 A typical fuuy rule based system
Defirzzification
System
Fuzzi fication
Subsystem
1wlnrc.d I f 'ord Speech Recognirion Using Fuzz37 iVetltd Techniques Page 45
A
In fer encs
Engine
Chapcer 4: Fuzzv Neural Nenvork-for S~eech Recoenirion
4.2 Fuzzy Neural Networks
4.2.1 Neural rr e fwork for Speech Recogn ifion
Traditional methods for speech recognition include Hidden Markov Models (HMM) and
DJ-namic 'rime Warping (DTW). HMM is a stochastic based approach. representing the
system with a number of States and calculating the probability to move from one state to
another depending on the input to the system. DTW adjusts the test pattern to conform
more closely ~vith a number of templates with dynarnic algorithms. Recently. Artificial
Neural Ketworks (ANNs) have become more and more popular for speech recognition.
Artiiïcial neural networks were first proposed in the 1940s. However, interest in this field
\vas increaszd in the earIy 1980s. The advantages of neural networks include: massively
parallsl processing with high spsed. robustness to complicated environments. learning
abilit).. fault tolerance and the ability to process incomplete data. Al1 of these make
neural netlvorks a very powerful approach for processing speech information. Neural
net\\-ork methods are also referred to as paraHel distributed processing or connectionist
approaches.
The discipline of neural networks has grown rapidly in recent years. Many researchers
have succsssfully presented and âppiied neural network in many fields such as speech
/sc>ictrr J I Ford Speech Recognition Using F q Neural Techniques Page 46
Chaprer 4.- Fzr=ly Yeural Nemork for Speech Rec~cnirion
recognition. image pattern recognition. sonar and radar signai processing and adaptive
control systerns [ 5 ] .
The use of neural network models are motivated by models of neurai systerns of living
organisrns. which are composed of large number of neurons and act in a venf compticated
\va>.. The basic processing unit of a neurai ssystem is called neuron (Figure 4.4). A
neuron consists of three parts:
Dendrite: receive impulses from other neurons
Ce11 body (Soma): receive series of impulses and results in increasing probability
that an ixnpulse will be triggered by the ce11
Ason: Carry the impulses from ce11 body to next neuron
Figure 4.5 gives an example of a typical processing unit for an artificial neural network.
Each neuron has a number of inputs and an output. Sirnilady to a neuron of a living
organism. the processing unit receives the multiple inputs and afier perforrning certain
tùnctions f: i t sends out the calculated result as output. like a natural newon being
triggered by input impulses. The output of a neuron may be passed to other neurons. or
recorded as one of the outputs of the system.
Isolared IVord Speech Recognition Using F z 3 1 Neural Techniques Page 47
Chumer 4: FLT Neural Nenvork for S~eech Recoenirion
dd Dendrite
- Soma - \Y
! s
Figure 3.4 A biological neuron
Figure 3.5 A mode1 of artificial neuron
/solaleci IfFordSpeech Recognition Using FU=? ~Verrral Techniqueir Page 48
Chu- ter 4: F u 3 Nezual Nerwork for Speech Recorrni~ion
4.23 Self Organizing Networks
Kohonen describes a speaker adaptive system using an unsupenfised learning dgonthm
[ 141. Kohonen's self-organizing map (SOM) nehvorks are designed to learn relations in
an unsupenfised manner. Afier training, the nehvork is able to group similar inputs
together in the output layer.
.As the SOM is unsuperviseci, its performance may be improved using a supervised
training method called learning vector quantization (LVQ) [I 51. The main difference is
that LVQ is concerned with searching for good category boundaries. while the SOM
tocuses on finding the reference vectors that are centroids of the input vectors. There are
three types of LVQ: LVQI. LVQ2 and LVQ3 [14]. In LVQ, the input data rnust bs
labeled and the outputs are divided into different classes. The learning rule is based on
moving the winning weight vector toward the corresponding input vector. Eventuaily.
the ~veight vector \vil1 becorne close representations of the input vectors afier training.
Thsse u-eights vectors forrns a trained weight matris called codebook.
The architecture of a LVQ neural network is shown in Figure 4.6. Since the motivation
of LVQ algo~ithrn is to find the output unit that is the closest to the input vector, the
\-setors in codtlbook are adjusted according to the input vector. I f input vector x and a
reference vector belong to the sarne class, then the weights are moved toward the new
Isolarecl Il'ard Speech Recognirion Using Fzczy hretrral Techniques Page 19
Chanter 4: Fuzzv Neural Nenreork for *eech Recoenirion
input vector: if r and wj belong to different classes, then we move the weights away fiom
the input vector. The algorithm is surnmarized as:
( 1 ) Ini tialize the codebook vectors and learning rate a(0)
( 3 ) For each training input vector x. End the winner w, so that Ilx-w,ll is minimum
(3) Update wj as follows:
if u and w, belong to same class. then
wj(new) = wj(old) + a [x-w,(old)];
if x and wj belong to different classes. then
w,(new) = wj(old) - a [~-w~(old)]
(1) reduce learning rats
( 5 ) if stopping condition is not satisfied, then repeat step 2 -- 4. otherwise stop.
In step 1. the codebook vectors could be initiaiized by either taking the first rn training
Lrsctors or the vectors with random values.
LVQ2 and LVQ3 are two improved algorithrns based on the LVQ 1. In LVQI, only the
n-inning reference vector is updated during training. The moving direction is deterrnined
b'. ~vllether the wiming vector belongs to the sarne class as the input vector. In the
iniproi-sd LVQ aigorithrns. two vectors (the wimer and the m e r - u p ) will be updated if
se\-eral conditions are satisfied.
lso!ared I lord Speech Recognirion Using Fic.z Nezrral Techniques Page 50
Figure 4.6 Learning vector quantization neural network
4.2.3 Fuüy Neural Sysiem
The theories of fuzzy sets and neural networks are two complementary ways of modeling
the human brain. Neural netsvorks mode1 the physical structure of the human neural
net\\-ork. ~vhile fuzzy Iogic simulates the way of human thinking. Therefore. the
combination of fuzzy sets and neural networks. which is calIed f u v y neural networks.
are becoming very promising for exploring the human brain.
isolared It'ord Speech Recognition Using Fu== Neural Techniques Page 5 /
Considerin9 the role and interaction of f u q logic and neurai nenvorks. researchers are
studying various issues on combining them for various applications. such as f u v y
reasoning and pattern recognition. Currently. fuzzy systems are b e g i ~ i n g to recognize
the use of neural network in various aspects of reasoning. A successful esample of the
combination is fuzzy learning vector quantization (FLVQ) 12 I l .
FLVQ has similar structure with LVQ. It extends the LVQ algorithm with fuzzy
concepts. In LVQ. the principle of updating is basically "wimer takes aI1". In other
u-ords. the ~vinner obtains a complete membership 1. while al1 the others get 0. Even in
LVQ2 and LVQ3. the mernbership is only given to the winner and the mnner-up. Based
on this. learning is only applied to update one or two reference vectors. In contrast. ail
the reference vectors are updated in FLVQ. For a specific training vector. FLVQ assigns
ixrious membership degrets to al1 the reference vectors. which provides the detailed
l eming information.
.4ssuniing c is the number of classes (i.e., the dimension of the second layer). the FLVQ
algorithm is described as follows [2 11:
( 1 ) generate an initial set of reference vectors W = {IV,. IV?. . ... . wE). select rn, and
nyas the initial and final values for the fuzziness parameter rn; set the iteration
number p = O and N as the maximum number of iterations;
isolared Il'ord Speech Recognition Using Fu=,y Neural Techniques Page 52
Chap fer 4: F c z v Xeural Nenvork for Speech Recocnzrion
(2) set m = m, +- p [ (ml -ml ) / NIi calculate the membership degrees behveen irh
training vector and jth weight vector:
( 3 ) Update reference vectors:
~vhere learning rate ai is
(4) if stopping condition is not satisfied. then repeat step 2 -- 3 , othenvise stop.
lsoiured Il'ord Speech Recognirion Using Fu==.= Neziral Techniques Page 53
Chamer 4: Frl-77 ~Veural Network-for Speech Reco~nition
Fuzq- C-Means (FCM) is a data clustering technique where each data belongs to a ciustsr
\\.ith a degree specified a membership degree. The technique was originally introduced
by Jim Bezdek [2 17 in 198 1 as an improvement of earlier clustenng methods [2 11. In the
follo\ving sections. the algorithm and application of fuzzy c-means clustenng for speech
recognition will be described.
.;l\ssurning there are r7 vectors xi with i = l 1 2. ... . n. then fuzzy C-means clustering will
partition the feature vectors r, into c hzzy groups. and find a cluster center for each
croup to minimize an objective function of dissirnilarity. - Al1 cluster centers are
represented by a prototype matrix V = (v,, vzo ... , v,). To accommodate the introduction
of fuzzy clustering, the membership matris U = {zi,) is generated with the values o f each
element set to be between O and 1. Thus. the summation of al1 membsrship degrees for
sach cIiister center was guaranteed to be equal to unity because o f the normalization
property:
-- - - - -
/sol arcd Il ' ~ r d Speech Recognition Using Fü=,zy Neural Techniques Page 54
Chaprer 4: FLZZ Neural lVemork Tor Speech Recopnitiorr
The objective fiinction for FCM is defined as
xvhere z r , is the element of membership matrix U which shouid have value between O and
1. i; stands for the cluster centsr (or prototype) of the fuvy group i. d, = I I r*, - x, i j is the
Euclidean distance between ith cluster center and jth input vector. and m is a wighting
parameter which indicates the degree of fuzziness. The parameter m is usually set as a
real \value greater than 1.
The necessan conditions to minimize the objective function O in equation 4.5 can be
found by forming a new objective fùnction O' as:
wherc A, ( j = 1 to n) are the Lagrange multipliers for the n constraints. By differentiating
O' xvith respect to each of its input arguments. the necessary conditions to minimize the
objecti~ve function are:
and
1solarc.d If brd Speech Recognit ion tising Fr- hreural Techniques
Cha~ter 4: FLZZL* Neuraf ~Venvork for S ~ e e c h Reco&fion
Bassd on the abo~re analysis. the FCM algorithm is sirnply an iterative procedure to meet
the abo\.e t~vo necessary conditions to minimize the objective function. Initially. the
cluster centers are very inaccurately placed. and every data point has a membership grade
for each cluster. By iterativeiy updating the cluster centers and the membership grades
for each data point. the cluster centers c m be moved to the right location in order to
rninimize the objective function that represents the distance from any given data point to
a cluster weighted by its membership _grade. Afier these batch procedures. the cluster
center and membership matrix will eventually be determined. FCM algorithm can be
summarized as follows [22]:
( 1 ) Select c. 12. and e as a tolerance value for the objective function: set fixed number
N as the ma.imum epoch and iteration counter q = 0.
(3) Initialize the cluster center Vo = {v lao . ~ 2 . 0 , .., . vcO ) for the first iteration;
(3) Set q = q +I . and update the membership degree. the cluster center and
convergent variance as follows:
--
lsolured Il'ord Speech Recognition Using FU=-,~ Neural Techniques Page 56
Cha~rer 4.- FLT ~Veural Xenvork for Speech Rrcopnirion
(4) I f q < N and E, > e. then go to step 3.
Isolared Il'orJ Speech Recognition Using Fuzzy Neziral Techniques Page 57
4.3.2 An Eka~nple
To illustrate how fùzzy c-means clustering worh. let's have a simple esampie with the
nvo-dimensional data that belong to two classes. Figure 4.7 (a) plots out al1 the 16 t\vo-
dimensional data.
O: Class 1
X : Class 2
Figure 4.7 (a) Two-dimension data before clustering
(b) The cluster centers found by FCM
/solared Il'ord Speech Recognition Using FG,Y Nerrral Techniques Page 58
Chaprer 4: FZLZZ Neural knvork for Speech Recoenirion
Afier applying fuzl c-means clustering algorithm. two centers were located with the
biggcr symbol as shoun in Figure 1.7 (b). Each data point has a mernbership grade for
the two cluster centers. For instance. the bonom-nght point has a member grade 0.07 for
cluster 1. and 0.93 for cluster 2.
FCM c m be applied to various clustering applications. In this thesis. FCM is used for
clustering the speech features for a certain nurnber of isolated words during the training
process. For training each word. a nurnber of samples from different speakers are chosen
to form the template. As we know, there are many factors to cause the variability
betn.een different samples for even the sanie words. Therefore. FCM can be used to
ssarching the cluster center for each word.
lsolared Word Speech Recognition Using F z q Neural Techniques Page 59
Chapter 5
Fuzzy Speech Recognizer
5.1 Issues on Implementing a Fuzzy Speech Recognizer
5.1. I T h e Norma fization
When irnplementing a speech recognition system. a speech pattern is usually represented
by a spectral sequence on a short-time basis. In most pattern recognition techniques.
These spectral sequences will be compared in order to decide the matching score.
I-Jon-ever. if a word is spoken hvice by the same speaker under the same environment. it
is still very likely that the tsvo samples will have different len_&s. The main reason of
this is that dit'terent renditions of the s m e utterance are seldom pronounced at exactly the
same speed and manner across the whole utterance. To deal with the speaking rate
fluctuation, it is strongly required to normalize the speech signal in order to make
cornparison and decision between patterns.
In the traditional algorithms, one of the waveforms is warped ont0 the time axis of the
other one. Consider hvo speech patterns X and Y which are represented by (xl, s7. ... x ~ , )
and (y!. y?. ... yTv), whsre xi and y, stand for the short-time feature vectors and T,, T,
denote the duration of Pattern X and Y respectively. In real applications, the duration T,
- --
Isolared Cl'ord Speech Recognition Using FIL? Neural Techniques Page 60
Chamer 5: FLZZ Speech Recoanizer
and T,- usually have diffèrent values. The dissimilarity between X and Y shouid be
measured based on solving the problem of normalizing the two sequences into the same
lengths.
In this thesis. the linear tirne normalization method is used for pattern recognition. The
dissimilarity between pattern X and Y is defined as:
(S. 1)
Where i, and i,. are integer numbers which denote the time indices of X? Y; and d(x, y,,~
is a tùnction for dissimilarity rneasurement between nvo vectors. Also. i, and i, should
sa t ise the foIlowing constraints:
By rounding the IenC& of pattern Y to the same length as pattern X. the surnmation of
distance for each vector in equation (5.1) is defined as the dissimilarity of X and Y.
Depending on the direction of the time normalization. the surnrnation c m be taken from
i! =1 to T, as well. Figure 5.1 illustrates how linear time normalization works for the
index conversion.
Isofared WorJ Speech Recognition Using Ftcy Neural Techniques Page 61
Figure 5.1 Linear time normalization for two sequences with different length
1.2 '... T,. + 1.2 .... T,
lsolcred IVord Speech Recognition Using F=y Neural Techniques Page 62
Chamer 5.- FZY Speech Recopnizer
5 . 2 Teniplate Training
The tsmplate-based method is used to implement the recognition system in this thesis.
As shoun in Figure 5.2. The feature vectors of an unknown word are fed into the
recognition network as the input. By computing the dissimilarity between the input
feature and each speech template. the nenvork can eventually decide the identity of the
unknou-n word with the decision algorithms.
Template (1) 1 Feature Vectors
Template (c) c
a 9 Decision a Rule b
1-4 , Recognized Word
Figure 5.2 Template-based word recognition system
~sulured Il brd Speech Recogntr ion Using FL-zy Neural Techniques Page 63
Chamer 5: FE,^: Speech Reco.cnizer
Before appl y ing the pattern cornparison technique according to Figure 5 2. firstl y the
templates should be trained and saved into a group of buffers which act like a memory
storing the related "dictionary". Assuming there are totally c words in the recognition
l i b r q . it rneans that c templates need to be trained. where each word template is
represented by the tirne-frequency feature.
Because the decision results rely on the templates very much. it is very critical to obtain
high quality templates that could represent the word features accurately. As described in
Chapter 2. the difficulties of speech recognition are mainly caused by al1 kinds of
in\-ariance of speech signals. Therefore. the ideal templates should be able to mode1 and
include the time-frequency informôtion of the speech signal with al1 the possible
fluctuations during training such as:
Speaker fluctuation
Different speaking rate
Di fferent manner of utterance
Environment noise
I-io\ve\.-er. it is an extremely difficult task to take care of ail the fluctuations in a real
implementation. Based on the fact that the most important variations are the speaker
fluctuation and speaking rate fluctuation. the clustering method will concentrate on
dealing with these two problems. Therefore. the training sets should contain the speech
signal taken from several speakers with di fferent speaking rates.
lsolured Il'ord Speech Recognirion Using Fuzzy Neural Techniques Page 64
The classical methods for template training include hard c-means clustering. self-
orsanizing map. and LVQ etc.. In this thesis. the FCM algorithm is used for clustering
the training sarnples and locating template centers because it offers the advantage of
modeling the speech fluctuations eficiently.
In the esperiments. recognition is perfonned using the fuzzy neural techniques for pattern
matching The membership functions are trained and used as the nenvork weight. TLVO
networks are developed based on measuring the similarity and dissimilarity respectively.
More details of these two methods are introduced in the following sections.
The basic idea of the fùzzy networks is to use the membership h c t i o n for classiljling the
~k-ord patterns that consist of the time-frequency feature. To illustrate the theory. let's
start fiom a simple example based on the typical parameter of vowels - formant
frequencies. The formants are defined as the resonant fiequencies of the vocal tract. and
it is known that the first three formant fiequencies could decide the characteristics of a
\.o\vel. Therefore. the membership fünctions should have three peaks. with each peak
correspond to one formant. To generalize the membership functionl the peak values of
Isolured Ilford Speech Recognition Using Fu:? Neural Techniques Page 65
membership function are normalized by l/3 (Figure 5.3). I f a11 formants o f an unknown
pattern can match the peaks of a mernbenhip function rxactly- then the membership
degree should be one. On the other hand. if the unknown pattern doesn't match the
membership function or has shifi from the center. ir should get low membership degree.
The degree D is denoted by:
o = [ ? Z ( f ) . .(fMf
Where y ( f ) indicates the location o f formantsfi. ~ 5 , as:
,~W=W-f,)+W-.L)+W-f,)
Figure 5.3 (a) Formant fi-equencies of a vowel
(b) Membership function of a vowel
lsolarrd llzord Speech Recognition Using Fu'~J Neural Techniques Page 66
Chopter 5: FLT: Speech RecoenCer
Because line spectnim frequencies can provide the formant information. LSFs are used to
form the feature vectors and membership fünctions in the recognition network.
.Assuming the order of LPC mode1 is 10. then there are 10 line spectrum frequencies F =
1 f i . i',. ... . f ioJ . The msmbership function can be constructed with rectangular or
Gaussian-shaped function as shown in Figure 5.4. The Gaussian function in t e m of fi
and f7 is given by:
Figure 5.4 (a) Rectangular shape membership function
(b) Gaussian-shaped membership hnction
/solared IVord Speech Recognition Using Fi- ~Veural Techniques Page 67
Chamer 5: Ftc-Y Speech Recoqnizer
The input vector x(f) of a speech frame is also constructed by LSFs in rectangular shape.
In the recognition network as shown in Figure 5.5. similarities between the unknown
featurr r and the template patterns are firstly calculated. then the h o w n pattern is
classitïed into the category which gets the largest similarity score.
Sequence of feature \.sctors (LSF): X
Figure 5.5 F u u y neural nenvork for isolated word recognizer
based on similarity measurement (Network 1 )
Isolnred II ord Speech Recognirion Using Fzc-y Nczrral Techniques Page 68
Cha~rer 5: FUIT Speech Reco-enkt'r
Setwork 2 (Figure 5.6) has a similar structure to that of network 1. but they are based on
di fferrnt decision rules. More specifically. network 1 measures the similarity between
the unknown and the template patterns. then recognizes the word with the maximum
similarity: ~ v h i l e network 2 measures the dissimilarity or distance and takes the minimum
as the n-inner.
Because net\\.ork 1 is based on matching the information of formant frequencies between
the unknoivn and the templates. only the line spectrum frequencies are appropriate to be
used as the time-varying feature for it. In net~vork 2. more coefficients could be adopted
for speech characteristics, such as cepstrum. log area ratio. reflection coefficients, etc.
When an unknown feature matrix X is applied to the network. the recognition process is
sumnlarized as follows:
(1 ) Nonnalize the length of the unknown pattem into the same Iength as each
tenlpIate weight:
( 2 ) Calculate the value of dissirnilarit>- betueen the unknown and al1 templates frame
by frame:
(3) Recognize the unknown word as the pattem which gets the smallest dissimilarity.
-
/solurèd It'ord Speech Recognition Using Faczy ~Veztruf Tcchniqzres Page 69
Sequence of feature \.c-ctors ( LSF. Ce~s tn im)
Figure 5.6 Fuzzy neural network for isolated word recognizer
based on dissimilarity measurement
fsolared I l ord Speech Recognition Using Fczy Nezrral Techniques Page 70
Chamer 3: Fz~tzv Speech Recopnizer
5.2 Speech Databasc
The speech database used for the recognition expenment consists of 10 isolated English
kvords. -411 the ten words are recorded with 8Wz sarnpling rate. 16-bit quantization
precision under laboratory environment. Each word is recorded ten times by ten speakers
(6 male and 4 femaie). Consequently. the speech database has a total number of 1000
utterances. in ~vhich there are 100 utterances for each speaker.
5.3 Simulations and Results
In this thrsis. the line spectrurn frequencies and LPC cepstral coefficients are used as the
speech feature sets. Both speaker dependent and speaker independent recognition are
tested in the esperiment.
Brforr processing. endpoint detection is performed for each utterance. Tlien the speech
signals are pre-smphasized and bloclied into small frarnes with 1 Oms overlapping
benveen adjacent frames. The pre-emphasis factor is set to 0.95. For each frarne. the
Hamming ~vindokv is applied with 3Oms window length; and then speech feature sets are
estracted based on the algorithm of LSF and LPC cepstnun.
/sofard Ifrord Speech Recognition Using FIL?. Neural Techniques Page 71
Chapter 5: Fuzz- S~eech Reco pnixr
In speaker-dependent recognition. the training data consist of 600 utterances from 6
speakers. and the remaining 100 utterances from these 6 speakers are usrd for testing.
Table 5.1 shows the speaker-dependent recognition rate with the techniques described
above.
Table 5.1 Recognition rate for speaker-dependent recognition
1 l John 1 18/30 1 21/30 1 28!30 / 29/30 ) 30130 19/30 1 27/30 1 18/30 1
L
Nehvork 1 1 Network 2 ( L W
1
, Hari 1 23/30 1 25/30 29130 29/30 29/30 29/30 / 29/30 29/30 I 1
1 I l 1
Neiwork 2 Network 2 i (Cepstrurn) / (Weighted Cepstrurn) 1
--
lsolarrd I f 'ord Speech Recognition Using Fzcy Netiral Techniques Page 72
For cornparison. Fikme 5.7 gives the overail recognition rate with FCM and crisp-mean
for al1 the methods. It is shown that FCM performs better than when taking the crisp
mecm \.due as templates.
-t Crisp -+ FCM
Figure 5.7 Cornparison of the speaker dependent recognition rate
wîth FCM and crisp means
It is shown in Figure 5.8 that network 2 yields better recognition rates than network 1
bscause the dissimilarity is utilized for decision rnaking, which should be more accurate
for distinguishing confusing words than when similarity measurernent is used.
fsofared IC'ord Speech Recognition Cising Fu? Neural Techniques Page 73
Figure 5.8 Speaker dependent recognition rate using LSF
with network 1 and network 2
/ . s o / u ~ c ~ Ilord Speech Recognition Using Fu-' Neural Techniques Page 7-1
Chapter 5: FLT Speech Recopnizer
Speaker-independent recognition uses 600 utterances from 6 speakers (3 female and 3
male), The remaining 400 utterances from other 4 speakers are used as test datz. Table
5.2 show-s the recognition accuracy for d l the words.
Table 5.2 Recognition rate for speaker-independent recognition
/ Network 2 Networh 2 1 1 Network 1 Nehvork 2 (LSF) (Cepstrum) / (Weighted Cepstrum) 1
i
l FCM ( cnsp 1 FCM i 39/40 1 35/40 38/40 i
/ Crisp 1 FCM / cnsp 1 FCM 1 c n s p
I I 1
In speaker-independent recognition. it is also proved that FCM yield better resuit than
crisp mean for template training (Figure 5.9). Figure 5.10 gives the cornparison of
netneork I and network 3 using LSF as speech features.
i Wayne 1 33/40
j John
-- - -
IsolareJ Il'ord Speech Recognition Using Ftrz=y Neural Techniques
34/40 138140 138140 137140
l
Page 75
19/40 25/40 W 4 0 31/40 28/40 33/40 / 29/40 1 31/40 !
36/40
36/40
34/10
I Tracy / 34/40 1 35/40 , 32/40 -
1
40140 1 36/40 1
l
35/40 j
--
3 O O 3&0
; Halima / 35/40 30140
31/40
No / 26/10 1 27/40 36/40 3 / 4 0 1 35/40
36/10 3 / 4 0
32/40
34/40
! Yrs / 16/40 118140
35/10
28/40 34/40
4 / 4 0 1 32/40 ?
35/10 1
Figure 5.9 Speaker-independent recognition rate with FCM and crisp means
Iso1arc.d IFord Speech Recognition Usir~g F q ~Veurai Techniques Page 76
Figure 5.10 Speaker-independent recognition rate using LSF
with netw-ork 1 and network 2
fsolarrd Il ,'ord Speech Recognition Using F i c q ~ Nelrrai Techniques Page 77
Chamer 6: Conclusions and Furure IVorh
Chapter 6
Conclusions and Suggestion for Future Work
6.1 Conclusions
This thesis esplored the issues involved in designing an isolated word recognition
spstem. especially the application of f u u y neural algoriduris for speech pattern
recognition. The LPC speech analysis method is described. and different representation
parameters are compared. It is shoun that the cepstral coefficients and line spectrum
frrquencies pla?. important roles as speech features in recent research and applications of
speech processing.
ic'aturally. fuzzy logic is similar to the way of human thinking. Fuvy sets are
successtU11y applied for speech recognition due to their ability to deal with uncertainty.
However. there's always a balance between "fuzzy" and "too fuzq". The idea of "fuzzy"
is good for modeling the uncertainty and variance of speech signak. But if it's "too
î ü u y " . it is highly probable that it wi l l cause a lot of confusion between the patterns
\vhich arc similar to each other but actually different. For instance. the word "bad" and
"bed" are very similar to each other. and fiizzy logic may not be distinguishable enough
for this case. Therefore. the neural networks are introduced to incorporate with fUzzy
logic to overcome this problem. -4s we know. neural networks simulates the "hardware"
/so/clted I f'ord Speech Recogrzition Using FUIT hreziral Techniques Page 78
Chaacer 6: Conclusiom and Furzue CVorks
of the human brain (human nerve) and have been known as a technique wih great
advantages of fault tolerance and robustness.
In this thesis. two fuuy networks have been proposed and applied for isolated word
recognition. The membership Funcrions are constructed by rneans of superposition of
speech features for each enrolled word and the templates are learned based on the
membership functions. The ternplate includes the fluctuation information of frequency
and time. By using these templates. the recognition system is able to recognize spoken
u-ords independent of the speaker.
The analysis and results of the recognition technique reveals that the use of fuvy logic
and neural networks can consistently improve the performance of the system. From the
reçults. FCM has been shown to be a better template-training algorithm than hard
clustering.
/solcrred Il'ord Speech Recognirion Using FIC? Neural Techniques Page 79
Chaprer 6: Conclrtsions and Future IForks
6.2 Suggestions for Future Work
The results in the thesis have proved the potential of fkq theory and neural nehvorks for
speech recognition. Btised on the proposed methods. It is still possible to improve the
s>-stenl m d get higher recognition rate.
.An important issue in the f m y neural network area is to find efficient combinations of
ANNs inspired by the structure of the human cortex because it forms the most intelligent
speech rscognizer so far. Also. it is certainly a promising direction to simulate the
natural mode1 of speech perception and productionT for both the feature estraction and
pattern recognition part.
Since some other techniques have aIready been successhlly used for speech recognition.
more efficient and integrated systems could be constnicted by con~bining fuzzy neural
techniques n-ith other formalisms. such as HMMs and DTW.
fsolareii [Ford Speech Recognition Using Ftczy iVezwal Techniques Page 80
References
Jean-Claude Junqua. Jean-Paul Haton. Robrîsrness in auromntic speech recognition
firndurnenrals and applications. Kiuwer Academic Publishers, 1996.
La\\~ence Rabinar. Bine-hwang Juang, Fzrndamentals of speech recognition.
Prentice Hall. Englewood Cliffs. 1993.
Joseph P. Campbell. JR.. "Speaker recognition: A tutorid". Proceedings of IEEE.
Vol. 85. No. 9. September 1997.
Lan-rence Rabinar. Ronald W. Sc hafer. Digital processing of speech signais,
Prentice Hall. Inc., Englewood Cliffs. NJ. 1978.
C. H. Chen. FE-q logic and neural nenvork handbook. McGraw-Hill Inc.- 1996.
F. Itakura. "Line spectrum representation of linear predictive coeff~cients of speech
signais". J. .4corîsr. Soc. -4mer. Vol. 57. pp. 53S(a). 1975.
K. K. Paliwal. "A study of line spectrum pair frsquencies for speech recognition".
IC.4 SSP. IEEE International Conference on Aco risr ics, Speech and Signal
Proccssir~g 1988. Vol. 1. pp. 485 - 488.
Samir Saoudi. Jean Marc Boucher. "A new efficient algorithm to compute the LSP
parameters for speech coding", Signal Processing. pp 20 1-2 12, 1992.
Seung Ho Choi. Hong Kook Kim, Hwang Soo Lee and R. M. Gray. "Speech
recognition method using quantised LSP parameters in CELP-type coders".
Electronics Lerters, 22 October, 1997.
/so/ared If ord Speech Recognirion Using Fu=-?: Areural Techniqzr es Page 81
K. K. PalwaI. "-4 study of line s p e c t m pair fiequencies for vowel recognition".
Speech Comrnztnicc~rions 1999, pp. 27-33.
C hi-Shi Liu. Chao-Shih Huang, Min-Tau Lin and Hsiao-Chuan Wang. "Automatic
speaker recognition based upon various distances of LSP frequencies". IEEE
I~zrer~~urioncrl Carnahan Conference on Secrrrit)? Technology Oct.. 199 1 . pp. 1 04-
109.
Frank K. Soong, Biing-Hwang Juang, "Optimal Quantization of LSP parameters".
IEEE Transacrions on Speech ans Aztdio Processing. Vol. 1, No. 1. J a n u q 1993.
N. Naja, J.M. Boucher and S. Saoudi. "Fast LSP vector quantization algorithms
CO m pari son". ;MELECON Proceedings of the 7th Mediterranean Electrotechnical
Conference - iCfELECON, Part 3, Apr. 1994, pp. 1 127 - 1 130.
Teu\.o Kohonen. "The self-organizing Map". Proceedings of the IEEE. Vol. 78,
No. 9. Ssptember 1990. pp. 1464 - 1477.
Eeik McDermontt and Shigeru Katagiri. "LVQ-based shifi-tolerant phoneme
recognition". IEEE Transactions on Signal Processing. Vol. 39. No. 6. June 1991.
pp. 1398- 1410.
Ravi P. Rarnachandran. Mihailo S. Zilovic. and Richard Jo Mammone, "A
comparative study of robust linear predictive analysis methods with applications to
speaker identification", IEEE Transacrions on Speech and Audio Processing. Vol.
3. No. 2, March 1995. pp. 1 17 - 125.
Akio Amano et al., "On the use of neural networks and f b z y logic in speech
recognition". lJCNN In[. JI. Conference on Neural Areiwork-. Jun 18-22. 1989. pp.
301 -305.
/ . S L ) / C Z I L ' ~ I I ord Speech Recognition Cising .Neural Techniques Page 82
Christopher Hale, CarnQuynh Nguyen. "Voice command recognition using fuzzy
logic". Wescon Conference Recor-d Proceedings of rhe 1995 Wescon Coiference,
Nov 7-9 1995. San Francisco. CA. USA, pp. 608-6 13.
Lynn Yaling Cai. Hon Keung Kwan, "Fuzzy classifications using fùzzy inference
networks". IEEE Transacrions on Systerns, Man. and Cybernetics -- Parr Br
Cyber-netics, Vol. 28, No. 3, June 1998, pp. 334-347.
Hon Keung Ktvan. Yaling Cai. Bin Zhang, "Mernbership function Iearning in fuzzy
classification". In[. J. Electronics. 1993. Vol. 74, No. 6, pp. 845-850.
Nicolaos B. Karayiannis. Jarne C. Bezdek. "An integrated approach to fuzm learning tfector quantization and fùzzy c-means clustering". IEEE Transactions on
Fzrzq' Sprenrs. Vol. 5 , No. 4. November 1997. pp. 622-628.
Jyh-Shing Roger Jang and Jiuann-Jyn Chen, "Neuro-fÙzzy and soft computing for
speaker recognition", IEEE International Conference on FZET~ Systems
Proceedings of the 1 99 7 6th IEEE International Conference on Frtzzy Systerns
FCrZZ-IEEE'9 7. Parr 2 (of 3) J d y 1997 iw 2 BarceIona, Spain. pp. 663 - 668.
Jun-ichiroh Fujimoto. Tomofumi Nakatani and Masahide Yoneyyama. "Speaker-
independent word recognition using îùzq pattern matching". Fuz-7 Sers arld
Sirirems 32. I 989. pp. 18 1 - 19 1.
Liushrng Liu. Zhijian Li and Bingsue Shi. "Speech recognition based in fuzzy
vector quantazation and f u u y logic"' IEEE Internarional Conference on Neural
.\i.nr*orks 1.5 1995 Perth, Aust, IEEE Piscuraway iVJ USA, pp. 2858-2862.
Liusheng Liu. Zhijian Li and Bingsue Shi. "Segment matrix vector quantization
and fuzzy logic for isolated-word speech recognition", Proceedings of The
Inter-naîiorïal Sj~mpositrm on A4uZtiple- Valued Logic, 1995, pp. 1 52 - 1 56.
Isolared IjFord Speech Recognition Using F Z L ~ hrezrral Techniques Page 83
[26] James W. Pitton, Kuansan Wang. and Bing-Hwang Juang, "Time-frequency
anal ysis and auditory modeling for automatic recognition of speech". Proceedings
of IEEE. Vol. 84. No. 9, September 1996. pp. 1 199 - 121 5.
[27] K. Davis. R. Biddulph. and S. Balashek. "Automatic recognition of spoken digits".
J. .-lcozrsric. Soc. Am., 1952, 23: pp. 3-50.
[ 2 5 ] J. Suzuki and K. Nakata- "Recognition of Japanese vowels - Preliminary to the
recognition of speech". J. Radio Res. Lab., 196 1, pp. 193-2 12.
[39] P. Denes. "The design and operation of the mechanicd speech recognizer". Journal
ojrhe Bt-irish Insrirrtre of Radio Engineers. 1959. pp. 31 1-229.
[30] T. V intspk. "Speech discrimination by dynamic prograrning", Kibernerika.
Cybet-itatics. pp. 8 1 -88.
[ j 1 j P. Lridefoged. "The phonetic basis for computer speech processing". Cornputer
Speech Processir~g, 1985. pp. 3-27.
[XI L. A. Zadeh. "Fu- setso', Inform. Conrrol. 1 965. pp. 338-352.
l so ln~~*d Iford Speech Recognition Using FE? Neural Techniques Page 84
Vita Auctoris
Xame: Hui PiNG
Place of Birth: Jiangsu. China
k'ear of Birth: 1973
Education: B. Eng.
Department of Electronic Engineering
Nanjing University of Aeronautics and Astronautics
Nanjing, China
1990 - 1994
M. A. Sc
Electrical and Cornputer Engineering
University of Windsor
Windsor, Ontario, Canada
1997 - 1999