isolated word speech recognition using fuzzy … word speech recognition using fuzzy neural...

Isolated Word Speech Recognition

Using Fuzzy Neural Techniques

by

Hui Ping

-4 Thesis Submitted to the College o f Graduate Studies and Research through the

Faculty of Engineering - Electrical and Computer Engineering in Partial Fulfillment o f the Requirernents for

the Degrse o f M s t e r o f Applied Science at the University of Windsor

Windsor. Ontario. Canada

1999

@ 1999 Hui Ping

National Library 1+1 of Canada Bibliothéque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. rue Wellington OttawaON K1A ON4 OnawaW K l A ô N 4 Canada Canada

The author has granted a non- exclusive Licence allowing the National Library of Canada to reproduce, loan, distribute or seil copies of thîs thesis in microform, paper or electronic formats.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

Abstract

Autoniatic speech recognition by machine is one of the most efficient msthods for man-

niacliine conimunications. Becaux speech waveform is nonlinear and variant. speech

recognition requires a lot of intelligence and fault tolerance in the pattern recognition

aigorithms. F w . neural techniques allow etliective decisions in the presence of

uncct%int>-. Cnnsrquently. the objective of this thesis is to study the f û z y neural techniques

Ibr the application in speech recognition. Two methods are proposed for isolated word

recognition using fuzzy pattern matching technique and Fuzzy c-means clustering technique.

I'hs algorithms are tested based on t u a LPC-based speech features: line spectrurn

frcqusncies and cepstral cosficients. It is shown that the fuzzy aigorithm is an efficient

approxh and c m provide reliable and accurate recognition results.

..- I I I

Dedicated to my family

for their love and support

Acknowledgements

1 \\.ould like to express my sincere gratitude to my thesis advisor Dr. H. K. Kwan. for his

suggestions. suidance. support and encouragement throughout the course of this research

\\-ork. I t has indeed been a privilege to work with him.

I ~ v i s h to thank rny department reader. Professor P. H. Alesander and rn!. estemal readrr. Dr.

Li\\-u Li. for thcir valuable advice tow-ard the fulfillment of the thesis work.

1 II-ould aIso Iike to thank al1 my friends in the iSPLab who have given me support during the

stud'- and research: Tracy Li. Halima El-Khatib. Wayne Chiang. Walter Jin and Jie Zhang.

Table of Contents

.Abstr;lct ............................................................... iii Dcdication .............................................................. iv

.Ackno~t.ledgements ....................................................... v

Cliaptcr 1 Introduction .................................................. 1

1.1 Background ................... ..,. ........................................................................... 1

1.2 Applications of Speech Recognition Technology ............................................. 3

1.3 Moti\-ation for the Rssearch .................................................................................... 3

1.4 Organization of the Thssis ..................................................................................... 5

Chapter 2 Literature Sumey on Speech Recognition ......................... 2.1 Introduction to Speech Sounds .......................... ... .............................................

2.1 . 1 Speech Production .........................................................................................

2 - 1 2 Speech Perception ..........................................................................................

............................................................................................. 2.1 -3 Speech Features 10

3.1.4 Representation of Speech Signai .............................................................. 12

2.2 Fundamental Speech Recopition Techniques ..................................................... 16

2.2.1 Classi ficarion ofSpeech Recognition ........................................................... 16

.............................................................. 2.2.2 Difticulties in Speech Recognition 18

.................................................................. 2 Speech Recognition Approaches 19

..................................... Chaptcr 3 Speech Feature Extraction 21

................................................................................... 3.1 Lincar Predictive Analysis 21

37 3.1 . 1 The LPC Mode1 ............................................................................................ -- ........................................................ 3 . 1.7 LPC Processor for Speech Recognition 28

................................................................................... 3 2 Line Spectnun Frequency 30

3.3 Cspstral Coefficients ........................................................................................... 36

Chapter 4 Fuzzy Neural Network for Speech Recognition .................. 40

4.1 F U Z Z ~ Logic ........................ ,. .......................................................................... 40

4.1.1 Background ................................... .... 40

4.1.2 Fuzzy Sets and Fuzzy Logic .......................................................................... 42

4.1 -3 Fuzzy System ................................................................................................. 44

4.2 Fuzzy Neural Networks ........................................................................................ 46

4.2.1 Neural networks for Speech Recognition ...................................................... 46

............................................................................. 4.2.2 Self Organizing Networks 49

4-33 Fuzzy Neural Systern .................................................................................... 51

4.3 Fuzzy C-Means Clustering ................................................................................... 54

4.3.1 Algorithm of FCM ................................................................................. 54

................................................................................................ 4.3.2 An Exarnple 58

4 -33 Sumrnary ..................................................................................................... 59

...................................... Chapter 5 Fu- Speech Recognizer 60

5.1 Issues on Implementing a Fuzzy Speech Recognizer ........................................ 6 0

5 .2.1 Time Norrnalization .................................................................................... 60

................................................................................. 5 2.2 Template Training

5 2 . 3 Recognition Network ...................................................................................

5.2 Speech Database ....................... .. .......................................................................

5.3 Simulations and Results .............................. .... ................................................ 71

.................... Chapter 6 Conclusions and Suggestion for Future Work 78

..................... G.1 Conclusions .... .......................................................................... 78

................................................................................. 6.2 Suggestion for Future Work 80

Vita Auctoris .......................................................... 85

List of Abbreviations

Arti ficial hreural Network

Automatic Speech Recognition

Fuzzy C-Means

Dynamic Time Warping

F u u y Logic

Fu- Learning Vector Quantization

F u u y Neural Network

Hidden Marke\- Mode1

Linear Predictive Coding

Line S p e c t m Frequency

Learning Vector Quantization

Self-Organizing Map

Figure 5.8: Speaker dependent recognition rate using LSF with

network 1 and network 2 ............................................................. 74

Figure 5.9. Speaker-independent recognition rate with FCM and hard means ............ 76

Figure 5.1 0: Speaker-independent recognition rate using LSF with

net~vork 1 and network 2 ............................................................. 77

List of Tables

Table 2.1 : Formant frequencies for eight vowels of mid-west Amencan English .......... II

T I 5 1 : Recognition rate for speaker-dependent recognition .............................. -72

Table 5.2. Recognition rate for speaker-independent recognition ............................ 75

Chapter 1

Introduction

1.1 Background

Automatic speech recognition by machine has been a part of science fiction for many

years. The early attempts tvere made in the 1950s by V ~ ~ O U S researchers. In 1953.

DaLk Biddulph and Balashek [27] designed the first isolated digit recognizer for a single

speaker at the Bell Laboratories. This system used a simple pattern matchincg method

u.itll templatss for each of the digits. -Matchhg was performed ~vith two parameters: a

frsquency cut based on separating the spectrum of the spoken digit into two bands and a

tùndan~ental frcquenq. estirnated by zero-crossing counting.

In 1961. Suzuki and Nakata [28] in Tokyo built a hardware vowel recognizer based on a

filter bank spectrum analyzer. In 1962. Sakai and Doshita of Tokyo University designed

a hardnwe phoneme recognizer. A hardware speech segmentor was used dong with a

zero-crossing analysis for different segments of the input speech to provide the

recognition resulr.

biosi of' the ab0L.e systems were implemented as electronics devices. However. speech

recornition could never anract so much attention until the flourish of digital cornputers.

Page I

The tïrst computer-based speech recognition systern kvas carried out in the early 60s.

Denes and Matthews p9] introduced the concept of time norrnalization in speech pattern

matching. In 1968. Russian researcher Vintsyuk [ ; O ] proposed the idea of dynamic

prograrnming methods of tirne alignment for speech patterns with different lengths. The

essence of this idea. ~vhich is caIIed DTW (dynamic time warping). is still widely used

for the current commercial products.

The 1970s and 1980s Lvere very active periods for speech recognition with a series of

important milestones:

Pattern recognition algorithms n-ere applied for the templats-based isolated worc!

recognition methods.

Continuous speech from large vocabularies was understood based on the use of high

1c.i~t.l knon-lsdge to compensate for the errors in phonetic approaches.

Speech analysis method based on Linear Predictive Coding (LPC) \vas uscd instead

of con\-entional msthods such as FFT and tiiter banks.

Statistical modeling such as the HMMs (Hidden Markov Model) n-ere developed for

continuous speech recognition

The neural net\\-orks (back propagation. learning vector quantization) with efficient

learning algorithms n-ere proposed for speech pattern matching

In rcccnt !-cars the speech recognition technology have begun to enter the real world in

our Iife. blore and more advanced algorithrns were adoptsd in this area. Fuzq neural

l so lufd Il 'ord Speech Recognif ion Using Ftc? .\ietrraf Techniques Page 2

techniques have aiso been applied to speech recognition and this field is growing and

de\feloping very fast.

1.2 Applications of Speech Recognition Technology

Currrntly. speech recognition systems are being devrloped for commercial applications.

One of the successful speech recognition systems is the Voice Recognition Cali

Processing (VRCP) system h m .4T&T. VPCP has a five-word vocabulary. and

automates operator assisted calls. AT&T also have a system knottn as Voice Interactive

Phone (VIP). with seven spoken commands replacing the touch tone codes. In this

sJestsnl. 94?6 of users Lvrre cornfortable with talking to the machine. and 84% of üssrs

preferrc-d the VIP system than the present system.

\!'ith cornputers becoming ever prssent in business. education. and governnlent. there is a

tremendous market for faster, more efficient man-machine interfaces. In the future. we

niIl be intensely using voice as input dong with the keyboard and morise. Most of the

\vindo\vs or othsr GUI operating systems-based applications will use speech recognition

to accept \*oice commands and conlrert voice into text.

.A s u m m a c of speech technologj. application areas are listed below:

Computer engineering: building a natural language interface to the computer

operating system or application software.

13-olarcd Ilr-ord Speech Recognition Using F t i ~ ~ Neural Techniqries Page 3

Program Developers: use pre-recorded voice-macros while developing a cornputer

program.

Telephone commerce (to replace touch-tone): telephone banking using \.oice

commands: order placement using ïoice to record incoming order data for the

customer service representatives.

Trlephony: hands-free dialing; comecting caller through a Company switchboard

\\.ithout human inten-ention: placing calls through -virtual' operator.

Physicians: record patient data: make records while doing observations or

perfonning operations.

Attorneys: use instead of secretaries: conduct online research.

1.3 Motivation of the Research

IiÏtli so much convenience that speech recognition could bring to Our life. there are

convincing reasons for researching and improving speech recognition technology.

Ho\ve\.er. achieving recognition is quite a difficult task. The complexity is due to the

nunibcr of the involved speakers. the variability of utterances. the comptexity of

lançuages. and the environrnsntaf conditions under which the speech recognition system

nwst operate.

Isolured Il'ord Speech Recognition Using Fti=,-1 Neural Techniqtres Page 4

Chapter / : /ntroducrion

n i e t\vo main concerns in speech recognition are to irnprove the recognition accuracy

and the processing speed. Therefore. the motivation of this research is to provide a

d i a b l e and efficient recognition method.

Beforr creating a general system to perform continuous recognition. this thesis deals with

isolatrd uvord recognition through the use of digital processing algorithrns and the

application of fuzzy neural techniques. Because of the uncertainty of speech waveforrns.

fuzz?. neural techniques are recognized as an efficient way to handle this problem. The

objective of this thesis is to utilize fuzzy neural techniques in designing a speech

recognition systern.

1.4 Organization of the Thesis

111 Chapter 2. A literature survzy is reviewed on speech recognition. It gives an

introduction to speech production and perception. speech signal features and fundamental

speech recognition methods.

Chapter 3 describes the dgorithm for speech feature extraction, cvhich is the first step in

the Lvhole process of speech recognition. In this chapter. the LPC analysis is discussed

and t~vo di f i rent LPC-based parameters -- line spectrum frequencies and cepstral

coefficients are pressnted for the use of speech recognition.

Page 5

Chap fer 1: Introduction

Chaptrr 4 presents the f u q Iogic and neural network theones for speech recognition.

The Fuzzy c-means algorithm is introduced for clustering the word ternplates.

In Chaprsr 5. a template-based fiizzy speech recognizer is describsd. It also indudes the

recognition results and analysis.

Chapter 6 gives the conclusions and suggestions for future research.

Isolu~ed J Iord Speech Recognition Using FIE-?) Neural Techniques Page 6

Chapter 2

Literature Survey on Speech Recognition

2.1 Introduction to Speech Sounds

2.1. I Speech Protlrrctiorr

Speech sound is produced by a set of well-controlled movements o f various speech

apparatus. Figure 2.1 shows a schematic cross-section through the vocal tract o f the

apparatus.

The vocal tract is a primary acoustic tube, wliich is the region of the mouth cavity

bounded by the vocal cords and the lips. As air is espelled from the lungs. the vocal

cords are tensed and then caused to vibrate by the airflow. The frequency of oscillation is

callsd the fundamental frequency. and it depends on the length. tension and mass of the

\-ocal cords. During this process, the shape o f the vocal tube is changed by different

positions of the velum, tongue. jaw and !ips [2]. The average length of the vocal tract for

an adult male is about 17cm. and its cross-section area can Vary in its outer section fiom O

to about 20cm'. Therefore. the vocal tract. as an acoustic resonator, wd1 determine

variable resonant frequencies by adjusting the shape and s i x of the vocal tract. The

resonant freqiiency is called the formant frequency or simply formant. The nasal tract is

isolcrred I lord Speech Recogn ilion Using FU,--3' Neural Techniques Page 7

Chaprer 2: Lirerarure Sumerl on Speech Recrnirion

an ausiliary acoustic tube that can be acoustically cooperated with vocal tract to produce

nasal sounds.

Figure 2.1 : Schematic vieu- of the human speech apparatus

Various speech sounds are producrd not only by adjusting the shape of the vocal tract.

but also the type of excitation. Besides the airflow from the lung. the escitation could

corne tiom some other sources: the fricative excitation. plosive excitation and whispered

excitation [3].

2.1.2 Speech Percepfioti

As the \focal system can produce speech sounds. the auditory system is capable of

dctrcting the change in air pressure of audible sounds [2]. Figure 2.2 shows a cross-

- - -

/soli~c'rJ Il urd Spccch Recognirion Using F U = ~ . h'ewal Techniques Page 8

Chaprer 2: L iferarure Survev on Speech Recomirion

section diagram of human ear. The sar consists of three parts: the outer ear. the middle

car. and the imer ear [26]. The outer ear collects the sound waves and passes the air

pressure Lariations to the eardrum. The middle ear is an air-filled cavity. which serve as

a mechanical amplifier and transfomi vibrations of the eardnim into oscillations of the

tluid tilled imer ear. The imer ear then converts the mechanical vibrations into elsctncal

potentials that go to the auditory nenre and the cortex.

The hurnan car is most sensitive to frequencies of the range from 1000 to JOOOHz. Most

speech infornlation is covered within thsse frequencies. It is shown by experiments that

human ears are largely phase insensitive. The basilar membrane is only deformed when

the stapes pushes on the oval window [l]. thus very little information is available for the

brain to determine the ~vaveform's phase. This fact couId be applied to speech

recognition to reduce the amount of data in the encoded waveform.

Outer Middle Inner ear ear ear

Figure 2.2: Cross-section of the human ear

lsolared If brd Speech Recogtririon Using F r c y Nerira! Techniques Page 9

C h a ~ r e r 2: Lirerature Sunwv on Speech Recoenition

2.1.3 Speech Featrires

The speech recognition can be divided into nvo processes: feature extraction and pattern

recognition. Feature extraction is responsible for searching the speech characteristics and

storing them for the second process: pattern recognition. In order to identify the speech

characteristics accurately and efficiently. it is necessary to investigate the features and

classi ticat ions of speech sounds.

.An: natural Ianguage, including English* is based on a set of distinguishable and

mutuaIl>- esclusive primary units. which are called phonemes. Al1 the phonemes are

relatsd to different articulatory gestures of a language.

There are several ways to classi@ speech sounds [ l , 21. According to the type of

cscitation source of phonemes. speech sounds can be classifred into the following

catsgories:

lbiced sounds (/a/. /ci/) occur when air pressure pushes the vocal cords open and

causes tliem to vibrate. The vibrating cords modulate the air Stream frorn the lungs at

a rate that could be as low as 60 times per second for some males to 500 times per

second for children. The peak amplitude of voiced sound is much higher than that of

the un\-oiced sound.

Isolarrd IfardSpeech Recognition üsing Ftcy Neural Techniques Page IO

k r s d sounds such as /rd. /n/ are also voiced. However, the nasal cavity is involved

togethsr u-ith the vocal cavity during the utterance. Part of the airflow is diverted into

the nasal tract by opening the \relum.

Fricurii-es are generated by esciting the vocal tract with turbulent flow created by

airtlow through a narrow constriction. For esample. the sound /f/. /s/ and /SM are

f'ricatil-es.

ibiced fi-icarives occur when the vocal tract is escited sirnultaneously by both

turbulence flow and vocal vibration. The sounds /z/. izh,' and /v/ belong to this

categor'-.

Plosiiees arc produced by esciting the vocal tract with a rapid release of pressure by

the constrictions of lips or teeth. The plosives /t/. Ad are voiceless, while /W. !di are

\,oicsd.

.!fi-iccrrii~ sounds are produced by gradua111 releasing a completely closed and

prcssurized vocai tract.

rl71isper-ecl sounds are escited by airflow mshing through a small triangular opening

bst~j-sen the al-tenoid cartilages at the rear of the neariy closed vocal foids.

For \.ou-el sounds. because the vocal tract remains relatively stable. three or four

resonance frequencies (fomants) c m usually be detected from O to 3KHz. Therefore. the

1-o\vel sounds c m be characterizcd by the two first fermants, where the third and fourth

tonnants arc less discriminative. Table 2.1 shows the three first mean formant

frequencies for eight vo~vels of Mid-West American English.

Chapcer 2: Lirerature Survev on Speech Remmition

Tabk 2.1 : Formant fiequencies for eight vowels of Mid-West Amencan English

( M e r Ladefoged. 1985 [3 11)

2.1.4 Represerrtatiorr of Speech Signal

.-\ speech signal can be broksn into several small components: phonemes. diphones.

syllablrs or words. where a phoneme is a minimal unit of speech sound. However. it is

practically difficult to identify an individual phoneme due to the overlapping of

phonenics. In automatic speech recognition. isolated word is used as the minimum unit

brcause it is relatively easisr to separate it within a sentence or phase.

Speech is a sloivly time varying activity which c m be simply graphically displayed b>* its

\va\.eform. The waveforrn is created by air pressure controlled by the lunps. vocal tract.

tongus and rnouth. However. the time domain representation is much less popular than

the frequency representations. This is because the hurnan ears perform some type of

frequenq. analysis rather than time domain analysis during the auditory process. and it is

/solcird Il'ord %cech Recognirion Using Fzcz? ;Lreural Tecltniqrres Page 12

found that the hurnan ear is much more sensitive to the magnitude spectrum than the

phase information of the speech signai.

1 I O 0.2 0.4 0.6 0 8 1

Time (sec)

Figure 2.3 : LVa\-eform of' the sentence "please log in"

Thc most popular representation of a speech signa1 is the spectrogram. which is a three-

dimensional representation on the time-frequency domain. The introduction of the

spectrogram provided a way to produce a display of the time varying spectral

characteristics of speech. An esample of spectrograii is shown in Figure 2.4. The

\-crtical asis represents frequenc). while the horizontal corresponding to time. The

darkness shows the signal energy at a certain time and frequency, and the location of dark

arras change while the pronunciations move from one vowel to another inside the isoltzied It'ord Speech Recognition Using FU=-?' Neural Techniques Page 13

utterance. Thersfore. the formant frequencies of the vocal tract show up as dark bands in

the diagram. For example. the first nvo dark bands in "please" are located around 3OOHz

and 1100Hz: while they are 600Hz and 1 OOOHz in word "log".

--

O 0.2 0.4 0.6 0.8 1 1.2 Tirne (sec)

Figure 2.3: Spectrogram of the sentence "please log in1'

GrnzraIl>.. \.oiced rsgions are featured by a striated appearance due to the periodicity of

the \\.aveforrn. while unvoiced regions are more evenIy filled in. This phenornenon is

slio~vn in Figure 2.5. which gives the waveform and spectrogram of a sentence "çtarting

ro do\\-nload". It is obvious that there are dark bands for voiced region and lighter color

is distributed for unvoiced regions. This is in coincidence with the fact that onIy the

\-oiced sounds have formant frequencies.

/.soiulc>Lj I Ii;)rd Speech Recognirion Using Fit=,?: iVeztral Techniques Page 1-4

Chamer 2: Lirerature Survar on Speech Recognition

"O 0.5 1 Tima (sec)

Figure 2.5: Waveform and spectrogram of the sentence "starting to downioad"

-

fso /tri L J ~ I lord Speech Recognition Using Ftrz,?. Nef va l Techniques Page 15

Cha~ter 2: Lirerature Survev on S~eech Recopnirion

2.2 Fundamental Speech Recognition Techniques

2.2. l Class~jÏcation of Speech Recogniliorr

Automatic speech recognition c m be classified into a number of different categoriss

depending on different issues:

1. Thc manner in which a user speaks. Usuaily there are three recognition modes bzst-d

on the spsakin, O manner:

IsoIated word recognition: The user speaks individual words or phrases from a

specified vocabulary. IsoIated word recognition is suitable for comrnand

recognition.

Connected word recognition: The user speaks fluent sequence of words with

smalI spaces between words. in which each word is from a specified

\.ocabulaq (e.g.. zip codes. phone numbers).

Continuous speech recognition: The speaker c m speak fluentll. with a large

\.ocabulary.

3. The number of users:

Speaker dependent: The users of a recognition system only consist of a single

speaker or a set of knorvn speakers.

Speaker independent: arbitrary users will use the ASR system in this case.

Speaker adaptive: The system will customize its response to each individual

speaker while it is in use by the speaker.

The s i x of the recognition vocabulary:

Isoiirrrd IIord Speech Recognirion Using F t q Neitral Techniqztes Page Id

Chapter 2: Literarure S u n a . on Speech Reco~eni~ion

.4 small vocabulary system only provides recognition capability for a small

.A large vocabu1ax-y system is capable of recognizing u-ords among a

vocabüilary containing up to 1 O00 words.

4. The degree of dialogue between the human and the machine. including:

One-n-ay conununication in which each user spoken unit is acted upon.

System drileen dialog systerns in m-hich the system is the onl). initiator of a

dialog. requesting information from the user via verbal input.

Natural dialogue systems in xvhich the machine conducts a conversation ~vith

the speaker. solicits inputs. acts in response to user inputs. or even tries to

cIarify ambiguity in the conversation.

Brcausc speech waveform is nonlinear and dynamic. speech recognition is an inherentl!.

dif'flcult task. There are several main variabilities of speech signal including \\<thin-

speaker \-ariability. across-speaker variability. transducer and transmission variability.

langunge complexity. and the environmental conditions under which a speaker is talking.

Ilïrhin-specrlier variability is caused by inconsistent pronunciation. speaking speed and

ciifferant emotions when the words or phrases are spoken by same speaker.

l.sol~rèd Iferd Speech Recognirion Ushg Fzcq* iVretrral T2clrniqztes Page 17

Chamer 2: Lirerarure Sririw on Speech Recognition

.i cross-speaker 1-ariability is due to the ph ysiological di fferences. regional accents.

foreign languages, etc. The physiological correlates are associated with the size and

coniïgurcltion of the components of the vocal tract of each individual. The variations in

the \-ocal tract can cause different resonance frequencies (fomants) and pitch frequency

of the same ~vords.

Trcrnsdztcet- am2 rrunsrnission vcrriabilis) is because the words are spoken over different

rnicrophondhandsets and the speech signal could be vansmitted by al1 kinds of

conimunication systems (telecommunication networks. cellular phones. etc.). in which

~inespccted noises are introduced into the signal.

Language compIesity makes speech recognition an estremely difficult job. So fàr. the

task of speech recognizers is simplified bj. Iimiting the number ofpossibIe utterances by

the imposition of semantic consuaints. On the other hand. we shall obey multi-

disciplinaqr natures of speech signal and be adaptive to the language complelrity because

spcech is a completely natwal activity of human beings.

E ~ i . i r o m m n r a I condition is also a main concem of speech recognizers while real

applications usually are conducted in adverse conditions which ma. drastically degrade

the s)-stem performance. Therefore. it is necessary to present robust recognition methods

for dcaling nith reasonable noise or distortions of the speech signal.

/sulu r d I l,*ord Speech Recognirion a i n g Fzizq? Nerrral Techniques Page 18

Chuprer 2: L irerature Surïev on Speech Rrcocnirion

2.2.3 Speech Recognition Approach es

2.2.3.1 Acoustic-Phonetic Approach

The zarliest approaches of speech recognition were based on the theon of acoustic

phonetics to tind speech sounds and provide phonetic characteristic Iabels for these

sounds. These esisting finite, distinctive phonetic units in spoken language couid be

characterized by a set of acoustic properties which are manifest in the speech signal over

tinlz. The first step in the acoustic-phonetic approach is to segment the speech signal into

stable acoustic regions and label them. followed by adding one or more phonetic labels.

The second step is to detennine a valid word from the phonetic label sequences based on

rhs first stsp. Because the difficulty of getting a reliable phoneme lattice in step one. the

acousric-phonstic approach has not been widely used for most commercial applications.

2.2.3 2 Pattern ~Matching Approach

The pattem matching approach is based on pattern recognition algorithms that require

pattern templates before recognition [2]. It has two steps: Pattern training and pattern

coniparison (Figure 2.6). Pattern training is responsible for establishing consistent

specch pattern representation for a set of known training samples. There are several

methods for training such as statistical models (e-g.? hidden Markov mode]) and

clustering training (learning vector quantization, fuzzy c-mean clustering). The second

Isolorcd II Qrd Speech Recogn ilion UsUrg F z z y h'eural Teclmiques Page 19

Chaoter 2: Lirerarure Surva? on Speech Recognirion

step. pattern cornparison compares the unknown speech with each template. and

detennine the identity of it by the matching algorithrns.

Speech Anal>.sis

Pattern Matching t Decision 4:

Figure 2.6: Block Diagram of Pattern Recognition Recognizer

3.2.3 -3 Computational Intelligence approach

The computational intelligence approach is a hybrid method of the acoustic-phonetic

approach and the pattern matching approach. Generally a neural network is applied to

integrrite the howledge of speech for segmentation and labeling. and intelligent tools are

iised for learning the relationship m o n @ phonetic events. This method has been pro\ped

to be a \.en. promising area for speech recognition and was widely used in commercial

applications.

/solarccl Il 'ord Speech Recognirion Using FE? Neural Techniqzm Page 20

Chamer 3: Speech Fearure Errracrion

Chapter 3

Speech Feature Extraction

3.1 Linear Predictive Analysis

Linear prsdictive analysis has been one of the most powerful speech anaiysis techniques

sincs it usas introduced in the early 1970s. Primarily it is a tirne-domain coding method

for low bit rats speech storage and transmission, but it can also be used for providing

fiequency-domain parameters (Iike formant frequency, bandwidth etc.) on the time basis

of the speech signal. In the apptication of speech recognition. these parameters can sente

as the speech characteristics representation.

For speech recognition. linear predictive coding (LPC) has several advantages o\.er 0 t h

techniques including:

LPC is capable of providing accurate estimates for the speech spectmm envelope.

It can be used to separate the excitation source properties of pitch and amplitude

from the \.ocal tract filter which controls the phoneme articulation and is directlq.

related to the produced speech sounds.

LPC is easy to be implemented by either sohvare or hardware because it is

mathematicaIIy precise, simpIe and straightfonvard.

/so/afc J Il-brd Speech Recognirion Using F t ~ y Neriral Techniques Page ZI

0 The LPC algorithm is computationally efficient. The required arnount of

computation of LPC is much Iess than diat of other techniques such as the fast

Fourier transfoml or filter bank model.

3.1. I Tire L PC Mode1

The LPC is a mode1 based on the vocal tract of human beings [4]. The basic idea of LPC

mode1 is that a speech sample x(n) c m be predicted by a linear combination of several

past sample values of speech:

~ ' ( 1 2 ) = a,x (n - 1) + a.s(n - 2) c ... + n p x ( n - p ) f3.1)

Where a,. ri?. ... . a, are called linear predictive coefricients. and they should be optimized

to minimize the prediction error betwcen the actual signal and the predicted values of this

saniple. u-hich is:

e(iz) = x ( n ) - -Y' ( n )

Although the speech signal is nonlinear and quite variant. the speech waveforrn over a

short prriods of time (around 10 to 30 msec j still remains roughl y invariant. Therefore.

the LPC coefficients can be re-calculated to minimize the mean squared prediction error

fsolarëd Ilord Speech Recognirion Using Fuzzy Neural Techniques Page 22

Chumer 3: Speech Featztre Ertracrior~

o\.er a short frame of the speech waveform. with each frarne segmented to a length of

around 10 to 30 msec,

Transforming the predictive error in equation 3.2 fiorn time domain into z-transfocm

Therefore. the transfer function behveen the speech sample and the prediction error could

bs \\-rittsn as:

U'hsn the LPC mode1 is applied to a speech signal. the predictive error E(z) can be

identiticd as the impulsive excitation of the vocal tract. while the al1 pole system H(z)

reprcsents the vocal tract modsl. This is how Iinear prediction separates out the

excitation properties of the source from the vocal tract filter: the source parameters are

deril-ed tiom the prediction error. and the ocal al tract filter is characterized by the linear

prsdicrive coefficients. Based on the analysis esperiments. the excitation source is

sssçntiall>. a quasi-periodic pulse train for voiced speech signals. and a random noise

signal for un\.oiced sounds.

X spcech synthcsis mode1 is built in Figure 3.1 based on the LPC model. The normalized

excitation signal u(n) is set to be either a quasi-periodic impulse train o r a random noise

/solnt cd Word Speech Recognition Using Frc? Xeural Techniqzres Page 23

C h q t e r 3: %eech Feature Extrocrion

signal (depending on the voiced/unvoiced determination). The appropriate gain of rhe

source G is estimated from the signal. and the scaled source is fed as input to a digital

tiltsr that represents the vocal tract model.

Pitch Period

Random Noise G Generator

Voiced/Unvoiced Impulse Train Swïtch LPC Coefficients

Generator

Figure 3.1 Block diagram of LPC-based speech synthesis model

Therr are tlirre basic algorithms to compute the LPC coefficients which could minimize

the prediction error over a speech frame [3]:

The autocorreIation method

The covariance method

The Iattice method

, ~("1, *

/solarcd Il hrd Speech Recognition Using Fu=,- Neural Techniques Page 2-1

7 t

Time-Varying Digital Filter

.4mong these three methods. the autocorrelation method is the most cornrnon used

method for linear predictive andysis. Defining the autocorrelation coefficients of speech

sarnples are given by:

Thsn the Iinear prediction coefficients can be computed using the Durbin-Levinson's

recursivs algorithm as shown beIow [2 ] :

u-herc the final solution of LPC coefficients are given as a,,, = a,,,"'. for I S m <p.

In speech recognition, the LPC model is used as a characteristic model for a speech

signal. Figure 3.2 and 3.3 gives the cornparison between the original speech power

spectrum and the magnitude spectrurn of the LPC model. It's obvious that LPC provides

a good approximation to the vocal tract spectral envelope. in which the information of

formant frequenc y and magnitude are included for speech recognition.

Isol~~red Il'ord Speech Recognirion L'sing Ftczy Aretuai Techniques Page 25

Chaprer 3: Soeech Feamre Euracrion

c 1 oaO MOQ m m mm rm im

Frequency (Hz)

Figure 3.2 (a) Original power specmm (b) Magnitude of LPC mode1 for phoneme / i l

ISOIUI~LI Il'ord Speech Recog~~irion Using F c y Netiral Techniques Page 26

Chapcer 3: Speech Feature Ertrucrion

Figure 3.3 (a) OriginaI power spectmm (b) Magnitude of LPC mode1 for phoneme /O/

/solarrd H'ord Speech Recognition Using Fz/~yv Neural Techniques Page 2 7

Chaprer 3: S ~ e e c h Fearrrre Ertracrion

3.1.2 L PC Processor for Speech Recognition

The LPC technique is used to build a front-end processor for a speech recognition systsm

to process a speech signal. s(n). as shown in Figure 3.4.

Preernphasis Frame xdn). Windowing Blocking

- I Conversion

Figure 3.4 Block diagram of LPC processor

The LPC processor includes follo~ving basic steps:

1. Pr~.entphnsis: A low-order system is applied to the speech signal in order to

spectrally flatten the signal and to make it less susceptible to finite precision effects

for the signal processing. The rnost wvidely used preemphasis filter is a first order

system:

/ . so /LI~c '~ IIord Speech Recognifion Using F i c ~ Neziral Techniqzres Page Z R

Chwrer 3: Speech Fzarure Errracrion

- 1 H ( z ) = I - a z , 0.85 l a 5 1 ( 3 -7)

~vhere the parameter a is usually set to be 0.95. Afier applying this filter. the output

s'(n) and input s(n) have the following relationship in the tirne domain:

~ ' ( 1 7 ) = s(n) - us ( n - 1) ( 3 -8)

3. Ft-cme Blocking: The preemphasized speech signal sl(n) is segmented into small

fianles. with N samples for sach frarne. Between the adjacent frames. there's b1

saniples overlapping to prevent the spectral discontinuous afier blocking.

3. Il,Ïmioii.N~g: Afier blocking the frames. a window is applied to each frarne to

minirnize the spectral discontinuities at the b e g i ~ i n g and the end of the speech

frams:

S, ' ( n ) = w ( n ) x , O?)? O I r z I h ; - l

.A t>rpical n k i o w is the Hamming window:

4 . LPC mrcdysis: For each frame. the LPC coefficients are calculated according to the

rscursive equation 3.6.

5. LPC prirrrnrerer conversion: In general. direct quantization and application of LPC

coefficients is inefficient and unreliable because the LPC coefficients are too dynamic

and a small quantization error could cause the entire filter to be unstable and

I.sul~~~c.cl I I 'urd Speech Rccognirion Using Fut? iCéuraI Techniques Page 2 9

Chaprer 3: Speech Fearure fitrocrion

inaccurate. Due to this weakness of LPC coefficientst some other related coefficients

are considered, such as the reflection coefficients. cepstral coefficients and line

spectral frequencies (LSFs). In this thesis. line spectral frequency and cepstral

cocficients are used as the extracted speech features. These two parameters are

described in the following section.

3.2 Line Spectrum Frequency

The line spectrum frequency \vas first proposed by Itakura in 1975 [6] as an alternative

paramctric representation for the LPC model. In the context of speech coding. LSF has

b w n stioivn to have better quantization and interpolation properties than other

representations such as reflection coefficient and log area ratio of the LPC model. Also.

a number of researchers have shown that a speech recognition system c m benefit from

thcss ad\-antages of LSF [7. 8.9. 101.

.-il,noritlrrrr ami Propertivs of LSF

In the LPC analysis of speech. assuming a speech frarne is modeled by an all-pole filter

H(z) = I 1-1 (1) m-ith order p. where .4/$ is the inverse filter given by:

Isolarcd Ilord Speech Recognition Using FI IZZ~ f i l t ra ! Techniques Page 30

Cha~rer 3 .- Speech Feature Errracrion

The LSF is a represented by mapping the p zercs of A(z) ont0 the unit circle through a

pair of (p+ 1) order polynomials P(z) and Q(z):

-(p - I ) Pfz j= . - l ( z ) - z .4(z-I)

These p~l~~nornials ccm be shown to have some interesting properties. The first is that al1

the zeros of P(z) and Q(z) lie on the unit circle and they are interlaced with each other.

Secondly. the frequencies tend to be clustered near the format fiequencies: when the P(z)

and Q(z) frequencies are close. it is Iikely that the original A(z) zero was close to the unit

circle. and a formant frequency is likely to be located between the corresponding

frcquenc~. pair. Nso. the closer one pair is, the sharper the formant will be. Thus. the LSF

coulci be utilizsd as the frequency features for speech recognition systems. Figure 3.5

sho\vs the spectrum and LSF of a speech segment. which dernonstrates the above t\vo

proprrtiés. Fiyure 3.6 and 3.7 give the LSF plots of two isolated words.

LSF have attracted much interest because they are good representations of LP systems.

and t).picnll>. result in quantizers having either bener representation or using fewer bits

for equi\.alent representation than reflection coefficient quantizers.

- .. - - - . -.

isoiared Ij'ord Srirech Recognition Using F E Z Neural Techniqtres Page 3 /

Chamer 3: S ~ e e c h Featzrre Ertracrion

1 O 2000 4000 6000 8000 10000 12000

Frequency (Hz)

Real part

*: LP poies

O: P(z) zeros A

-: Q(z) zeros

Figure 3.5 LSF and LP poIes in the z plane of phome /O!

lsola!d IlOrd Speech Recognition Using FIET Netrral Techniques Page 32

Chamer 3: S~eech Fearure Ertracrion

-80 2000 4000 6000 8000 10000 1

Frequency (Hz)

Ls? frequencies and LP poies in the z-plane

I i

-0 5 O O. 5 Real pan

*: LP poles

O: P(z) zeros - -: Q(z) zeros

Figure 3.6 LSF and LP poles in the z plane of phonme /il

/.soltrfed I I brd Speech Recognilion Using F z r z ~ Necrral Techniques Page 33

U Order o f LSF

Figure 3.7 LSF of word "no"

Line Spectrum Frequency of 'call"

Frame - U Order of LSF

Figure 3.8 LSF of word "call"

- --

lsolafed Il'ord Speech Recogntrion Using Fczy Neziral Techniques Page 34

Chaprer 3: Soeech Feature Ew-action

Figure 3 -9 LSF of word "hangup"

Line Spectrum Frequency of "halima"

\ . . . I . 10 . , .

\

. '. % /6

8

Frame 2 4 - U Order of LSF

Figure 3.10 LSF of word "Halima"

/solarrid Word Speech Recognition Using F u z z Neural Techniques Page 35

C h a ~ t e f 3: Speech Fearure Eurroction

Cepstral Coefficients

CepstraI coefficients have been proved to be another efficient and robust feature set for

speech recognition. Origindly, the cepstnim of a speech signal x(n) is defined as the

Fourier transform of the logarithrn of the magnitude of the specuvm X(dtV):

Based on the LPC model. by applying the smoothed magnitude as the magnitude

spsctrum. cepstral coefficients can be derived directly fkom the LPC coefficient set with

the recursive formula:

Prop r rtirs of Cepsf rai Coe fflcients

To tnake use of the cepstral coefficients properly for speech recognition, it's necessary to

kno~v the properties of thern:

Most information of speech signal is represented by the Iower numbered cepstraI

coefficients. and the firsr p coefficients can uniquely determine the all-pole filter of

LPC model

isolared IVord Speech Recognition Using Fzcy Netiral Techniques Page 36

Cha~rer 3: Speech Feature Errroclion

Cepstrum is a decaying sequence. under regular conditions, the variances of

coefficients (escept co) are essentially inversely proportional to the square of the

coefficient index (Figure 3.1 1 - 3.14)

Because the cepstrum has infinite index numbers, only the first 10 to 30 coefficients are

taken for representing the speech feature based on the above properties of cepsual

coefficients.

Weigh trd Cepstral Coefficients

The \variance of cepstraI coefficients is inversely proportional to the square of the

coefficient index as fo l lo~-s [ 2 ] :

The cepstral coefficients can be normalized with the index m, to balance the contribution

Srorn crich cepstral coefficient. then the weighted coefficients become:

- cm = mcm. I L r n l L (3.17)

.A more coniplicated weighting function can be applied for de-emphasiszing the

coefficient around rn=I und i:

- --

/sularc.d I : brd Speech Recognition Using FIIZZJ, Nezrral Techniques Page 3 7

Cepdnl Coficients of 'no'

Frarne L

Oder of Cepstrum

Figure 3.1 1 Cepstrai coefficients of word

Cepstral Coeficients of 'call'

"no"

u Order of Cepstrum

Figure 3.12 Cepstral coefficients of word "call"

/.so/ared I f Ord Speech Recognirion Using F Z L ~ Neicral Techniques Puge 38

Chaorer 3: Speech Feafure Errracrion

L

Order of Cepstral Coeficients

Figure 3.1 3 Cepstral coefficients of word " hangup"

Fnme

Figure 3.14 Cepstral coefficients of word "halima"

isoluretl Ilord Speech Recognition Using Fzc-y Neural Techniques --

Page 39

Chamer 4: FLZZ Neural Nenvork for S~eech Recognition

Chapter 4

Fuzzy Neural Network for Speech Recognition

4.1 F u z q Logic

4.1.1 Background

Fuzzy sets were introduced by Zadeh [32] in 1965 as a new way to represent and

manipulate data with uncertainty and fuzziness. in the old paradigm. fûzziness was

considered unfavorable because of the expectation for scientific precision and accuracy.

However. f q interpretations of data is a naturai and intuitively plausible way to

formulate and solve a lot of problems in our everyday life. For example. expressions

with uncertainty like "hot coffee". "hea\y objects". and "warm weather" are fuzzy

interpretations.

Although both fuzzy sets and statistical theory c m deal with uncertainty, fuzzy sets are

quite different fiom statistical rnodels in some ways. Probabilities represent the

likelihood of a certain event with a distribution arnong ail the events. while a fuzzy set

represents the applicability of the element to the set. In another word, the fùzziness

provides more uncertainty that can be found in the meanings of many words fiom

human's thinking.

Isolared IVord Speech Recognition Using Fu==y Neural Techniques Page 40

Cha~ter 4: FLT Neural ,Vetwork jor Speech Recopnition

Today. we have witnessed a rapid growth in a variety of applications of fuzzy logic. The

applications range fiom consumer products such as washing machines. cameras.

camcorders. and microwave ovens to industrial process control. medical instrumentation.

pattern recognition. decision-support systems' and portfolio selection. As we know.

communication by speech is a natural activity of hurnan beings and contains a lot of

uncenainty during both the speech production and reco-~t ion process. The application

of fuzzy logic to speech recognition actually simulates the way that people understand

rach other every day. The reasons why hzzy logic can be applied to speech recognition

are described as foliowing:

Fuuiy Iogic is conceptually easy to understand. The mathematical concepts

behind fuzzy reasoning are very simple. What makes fuuy attractive is the

-'naturalness" of its approach and not its far-reaching complexity.

F u z y Iogic is flexible with tolerance for imprecise data. Everything is imprecise

if .ou Iook closely enough. but more than that. most things are imprecise even

undsr careful inspection.

9 Fuzzy Iogic can mode1 nonlinear fimctions of arbitrary complexity. A fuzzy

system can be created to match any set of input-output data. This process is made

particularly easy by adaptive techniques Iike ANFIS (Adaptive Neuro-Fuzzy

Inference Systems).

Fuzzy logic is based on natural language. The b a i s for fuzzy logic is the basis for

human communication. This observation underpins many of the other staternents

about fuzzy logic. NaturaI language. which is used by o r d i n q people on a daily

Isolared If'ord Speech Recognition Using Frcy Neural Techniques Page 41

Cha~ter 4.- FIL-,~: Ne~wai ~Vetwork for Sueech Recoqnition

basis, has been shaped by thousands of years of human history to be convenient

and efficient. Sentences witten in ordinary language represent a triurnph of

sfficienr communication.

Fuzzy sets are a super-set of classical sets. In a fuzzy set. each element is associated with

a real \.due which represents the degree of membership of the element in the closed unit

in tend [O. 11. However. in classicai crisp sets. al1 element c m only be classified as "O"

or "1 ". When al1 elements in a set have either complete membership or complete non-

rnembership, the fuzzy set reduces to a crisp set.

Suppose a fuzzy set A is a subset in space X which admits partial rnembership. It is

delincd as the ordered pair A = { S . m..l(s)). where 'c EX and O I m ~ ( s ) I 1. Every f u z q

set consists of the three parts: a horizontal axis x specifying the population of sets: a

iw-tical membership a ~ i s rn&) which specifies the membership degree of each element:

and the surface itself to provide a one to one connection between the elements and their

corrcsponding membership degree.

For esample, let hzzy set X represent the concept of "tall" for women over 20. Women

5 feet or less than 5 feet have no rnembership in the set "tall", while women over 6 feet

have total membership. To detcnnine the membership for a specific height, the height is

fsolared It'ord Speech Recognition Using F q Neural Techniques Page 42

Chorirer 4: Fu- Neural ,Venuor& for Speech Recognilion

first found on the horizontal mis, then following the membership degree fùnction. the

value of membership will be located from the vertical auis. Figure 4.1 iIlustrates this

exarnple for fiizzv set "tall". while heights between 5 feet and 6 feet are proportionally

distributed.

5 6 s (feet)

Figure 4.1 Membership h c t i o n of fuzzy set for the concept "taIl"

The ideal f u u y sets representing a concept could be further espanded by linguistic

\-ariables. A Iinguistic variable is assigned to a f u q region consisting of a set of fuzz>,

sets. Figure 4.2 shows an esample expanded from the exainple in Figure 4.1 for the

concept "height". The variable consists of three fuzzy sets: short. medium and tall. The

horizontal axis specifies the base variable of height, and the degree of membership in

each fuzzq- set are determined by the vertical a ~ i s .

lsolart.d Word Speech Recognition Using Fuqv Neural Techniques Page -13

3 5 6 x (feet)

Figure 4.2 A linguistic variable of "height"

F u u > - systems use fuzzy set theory to deal with hrzy or non-fùzy information.

General1)-. a fuzzy system consists of a fuuification subsystem. a fuuy inference engine.

a f u z q rule base and a defuzzifier as shown in Figure 4.3. The fuzzy rule base and fùzzy

inference engine is the core of the hy-rule-based system. A h y nile c m be

espresssd by a set of f ù q inference rules in the form of "IF s is A THEN y is B" [19].

[?O]. The inference engine then implements a f u q inference algorithni to determine the

fùzzy output from the inference mIes and the inputs.

lsolured I Vord Speech Recogn ilion Using FUZY Neural Techniques Page 44

Chaprer 4.- FII,"~~ Neural Nehvork for Speech Recoqnirion

Xote that a given input may sirnultmeousIy be a member o f more than one set within a

single fuzzy region. The inference engine interacts with the mle base and uses the inputs

to determine which rules are applicable. The outputs are a set of fùzzy sets defined on

the uni\.erse of possible outputs which will be defuzzified to generate crisp outputs.

F u q Rule Olj] Figure 1.3 A typical fuuy rule based system

Defirzzification

System

Fuzzi fication

Subsystem

1wlnrc.d I f 'ord Speech Recognirion Using Fuzz37 iVetltd Techniques Page 45

A

In fer encs

Engine

Chapcer 4: Fuzzv Neural Nenvork-for S~eech Recoenirion

4.2 Fuzzy Neural Networks

4.2.1 Neural rr e fwork for Speech Recogn ifion

Traditional methods for speech recognition include Hidden Markov Models (HMM) and

DJ-namic 'rime Warping (DTW). HMM is a stochastic based approach. representing the

system with a number of States and calculating the probability to move from one state to

another depending on the input to the system. DTW adjusts the test pattern to conform

more closely ~vith a number of templates with dynarnic algorithms. Recently. Artificial

Neural Ketworks (ANNs) have become more and more popular for speech recognition.

Artiiïcial neural networks were first proposed in the 1940s. However, interest in this field

\vas increaszd in the earIy 1980s. The advantages of neural networks include: massively

parallsl processing with high spsed. robustness to complicated environments. learning

abilit).. fault tolerance and the ability to process incomplete data. Al1 of these make

neural netlvorks a very powerful approach for processing speech information. Neural

net\\-ork methods are also referred to as paraHel distributed processing or connectionist

approaches.

The discipline of neural networks has grown rapidly in recent years. Many researchers

have succsssfully presented and âppiied neural network in many fields such as speech

/sc>ictrr J I Ford Speech Recognition Using F q Neural Techniques Page 46

Chaprer 4.- Fzr=ly Yeural Nemork for Speech Rec~cnirion

recognition. image pattern recognition. sonar and radar signai processing and adaptive

control systerns [ 5 ] .

The use of neural network models are motivated by models of neurai systerns of living

organisrns. which are composed of large number of neurons and act in a venf compticated

\va>.. The basic processing unit of a neurai ssystem is called neuron (Figure 4.4). A

neuron consists of three parts:

Dendrite: receive impulses from other neurons

Ce11 body (Soma): receive series of impulses and results in increasing probability

that an ixnpulse will be triggered by the ce11

Ason: Carry the impulses from ce11 body to next neuron

Figure 4.5 gives an example of a typical processing unit for an artificial neural network.

Each neuron has a number of inputs and an output. Sirnilady to a neuron of a living

organism. the processing unit receives the multiple inputs and afier perforrning certain

tùnctions f: i t sends out the calculated result as output. like a natural newon being

triggered by input impulses. The output of a neuron may be passed to other neurons. or

recorded as one of the outputs of the system.

Isolared IVord Speech Recognition Using F z 3 1 Neural Techniques Page 47

Chumer 4: FLT Neural Nenvork for S~eech Recoenirion

dd Dendrite

- Soma - \Y

! s

Figure 3.4 A biological neuron

Figure 3.5 A mode1 of artificial neuron

/solaleci IfFordSpeech Recognition Using FU=? ~Verrral Techniqueir Page 48

Chu- ter 4: F u 3 Nezual Nerwork for Speech Recorrni~ion

4.23 Self Organizing Networks

Kohonen describes a speaker adaptive system using an unsupenfised learning dgonthm

[ 141. Kohonen's self-organizing map (SOM) nehvorks are designed to learn relations in

an unsupenfised manner. Afier training, the nehvork is able to group similar inputs

together in the output layer.

.As the SOM is unsuperviseci, its performance may be improved using a supervised

training method called learning vector quantization (LVQ) [I 51. The main difference is

that LVQ is concerned with searching for good category boundaries. while the SOM

tocuses on finding the reference vectors that are centroids of the input vectors. There are

three types of LVQ: LVQI. LVQ2 and LVQ3 [14]. In LVQ, the input data rnust bs

labeled and the outputs are divided into different classes. The learning rule is based on

moving the winning weight vector toward the corresponding input vector. Eventuaily.

the ~veight vector \vil1 becorne close representations of the input vectors afier training.

Thsse u-eights vectors forrns a trained weight matris called codebook.

The architecture of a LVQ neural network is shown in Figure 4.6. Since the motivation

of LVQ algo~ithrn is to find the output unit that is the closest to the input vector, the

\-setors in codtlbook are adjusted according to the input vector. I f input vector x and a

reference vector belong to the sarne class, then the weights are moved toward the new

Isolarecl Il'ard Speech Recognirion Using Fzczy hretrral Techniques Page 19

Chanter 4: Fuzzv Neural Nenreork for *eech Recoenirion

input vector: if r and wj belong to different classes, then we move the weights away fiom

the input vector. The algorithm is surnmarized as:

( 1 ) Ini tialize the codebook vectors and learning rate a(0)

( 3 ) For each training input vector x. End the winner w, so that Ilx-w,ll is minimum

(3) Update wj as follows:

if u and w, belong to same class. then

wj(new) = wj(old) + a [x-w,(old)];

if x and wj belong to different classes. then

w,(new) = wj(old) - a [~-w~(old)]

(1) reduce learning rats

( 5 ) if stopping condition is not satisfied, then repeat step 2 -- 4. otherwise stop.

In step 1. the codebook vectors could be initiaiized by either taking the first rn training

Lrsctors or the vectors with random values.

LVQ2 and LVQ3 are two improved algorithrns based on the LVQ 1. In LVQI, only the

n-inning reference vector is updated during training. The moving direction is deterrnined

b'. ~vllether the wiming vector belongs to the sarne class as the input vector. In the

iniproi-sd LVQ aigorithrns. two vectors (the wimer and the m e r - u p ) will be updated if

se\-eral conditions are satisfied.

lso!ared I lord Speech Recognirion Using Fic.z Nezrral Techniques Page 50

Figure 4.6 Learning vector quantization neural network

4.2.3 Fuüy Neural Sysiem

The theories of fuzzy sets and neural networks are two complementary ways of modeling

the human brain. Neural netsvorks mode1 the physical structure of the human neural

net\\-ork. ~vhile fuzzy Iogic simulates the way of human thinking. Therefore. the

combination of fuzzy sets and neural networks. which is calIed f u v y neural networks.

are becoming very promising for exploring the human brain.

isolared It'ord Speech Recognition Using Fu== Neural Techniques Page 5 /

Considerin9 the role and interaction of f u q logic and neurai nenvorks. researchers are

studying various issues on combining them for various applications. such as f u v y

reasoning and pattern recognition. Currently. fuzzy systems are b e g i ~ i n g to recognize

the use of neural network in various aspects of reasoning. A successful esample of the

combination is fuzzy learning vector quantization (FLVQ) 12 I l .

FLVQ has similar structure with LVQ. It extends the LVQ algorithm with fuzzy

concepts. In LVQ. the principle of updating is basically "wimer takes aI1". In other

u-ords. the ~vinner obtains a complete membership 1. while al1 the others get 0. Even in

LVQ2 and LVQ3. the mernbership is only given to the winner and the mnner-up. Based

on this. learning is only applied to update one or two reference vectors. In contrast. ail

the reference vectors are updated in FLVQ. For a specific training vector. FLVQ assigns

ixrious membership degrets to al1 the reference vectors. which provides the detailed

l eming information.

.4ssuniing c is the number of classes (i.e., the dimension of the second layer). the FLVQ

algorithm is described as follows [2 11:

( 1 ) generate an initial set of reference vectors W = {IV,. IV?. . ... . wE). select rn, and

nyas the initial and final values for the fuzziness parameter rn; set the iteration

number p = O and N as the maximum number of iterations;

isolared Il'ord Speech Recognition Using Fu=,y Neural Techniques Page 52

Chap fer 4: F c z v Xeural Nenvork for Speech Recocnzrion

(2) set m = m, +- p [ (ml -ml ) / NIi calculate the membership degrees behveen irh

training vector and jth weight vector:

( 3 ) Update reference vectors:

~vhere learning rate ai is

(4) if stopping condition is not satisfied. then repeat step 2 -- 3 , othenvise stop.

lsoiured Il'ord Speech Recognirion Using Fu==.= Neziral Techniques Page 53

Chamer 4: Frl-77 ~Veural Network-for Speech Reco~nition

Fuzq- C-Means (FCM) is a data clustering technique where each data belongs to a ciustsr

\\.ith a degree specified a membership degree. The technique was originally introduced

by Jim Bezdek [2 17 in 198 1 as an improvement of earlier clustenng methods [2 11. In the

follo\ving sections. the algorithm and application of fuzzy c-means clustenng for speech

recognition will be described.

.;l\ssurning there are r7 vectors xi with i = l 1 2. ... . n. then fuzzy C-means clustering will

partition the feature vectors r, into c hzzy groups. and find a cluster center for each

croup to minimize an objective function of dissirnilarity. - Al1 cluster centers are

represented by a prototype matrix V = (v,, vzo ... , v,). To accommodate the introduction

of fuzzy clustering, the membership matris U = {zi,) is generated with the values o f each

element set to be between O and 1. Thus. the summation of al1 membsrship degrees for

sach cIiister center was guaranteed to be equal to unity because o f the normalization

property:

-- - - - -

/sol arcd Il ' ~ r d Speech Recognition Using Fü=,zy Neural Techniques Page 54

Chaprer 4: FLZZ Neural lVemork Tor Speech Recopnitiorr

The objective fiinction for FCM is defined as

xvhere z r , is the element of membership matrix U which shouid have value between O and

1. i; stands for the cluster centsr (or prototype) of the fuvy group i. d, = I I r*, - x, i j is the

Euclidean distance between ith cluster center and jth input vector. and m is a wighting

parameter which indicates the degree of fuzziness. The parameter m is usually set as a

real \value greater than 1.

The necessan conditions to minimize the objective function O in equation 4.5 can be

found by forming a new objective fùnction O' as:

wherc A, ( j = 1 to n) are the Lagrange multipliers for the n constraints. By differentiating

O' xvith respect to each of its input arguments. the necessary conditions to minimize the

objecti~ve function are:

and

1solarc.d If brd Speech Recognit ion tising Fr- hreural Techniques

Cha~ter 4: FLZZL* Neuraf ~Venvork for S ~ e e c h Reco&fion

Bassd on the abo~re analysis. the FCM algorithm is sirnply an iterative procedure to meet

the abo\.e t~vo necessary conditions to minimize the objective function. Initially. the

cluster centers are very inaccurately placed. and every data point has a membership grade

for each cluster. By iterativeiy updating the cluster centers and the membership grades

for each data point. the cluster centers c m be moved to the right location in order to

rninimize the objective function that represents the distance from any given data point to

a cluster weighted by its membership _grade. Afier these batch procedures. the cluster

center and membership matrix will eventually be determined. FCM algorithm can be

summarized as follows [22]:

( 1 ) Select c. 12. and e as a tolerance value for the objective function: set fixed number

N as the ma.imum epoch and iteration counter q = 0.

(3) Initialize the cluster center Vo = {v lao . ~ 2 . 0 , .., . vcO ) for the first iteration;

(3) Set q = q +I . and update the membership degree. the cluster center and

convergent variance as follows:

--

lsolured Il'ord Speech Recognition Using FU=-,~ Neural Techniques Page 56

Cha~rer 4.- FLT ~Veural Xenvork for Speech Rrcopnirion

(4) I f q < N and E, > e. then go to step 3.

Isolared Il'orJ Speech Recognition Using Fuzzy Neziral Techniques Page 57

4.3.2 An Eka~nple

To illustrate how fùzzy c-means clustering worh. let's have a simple esampie with the

nvo-dimensional data that belong to two classes. Figure 4.7 (a) plots out al1 the 16 t\vo-

dimensional data.

O: Class 1

X : Class 2

Figure 4.7 (a) Two-dimension data before clustering

(b) The cluster centers found by FCM

/solared Il'ord Speech Recognition Using FG,Y Nerrral Techniques Page 58

Chaprer 4: FZLZZ Neural knvork for Speech Recoenirion

Afier applying fuzl c-means clustering algorithm. two centers were located with the

biggcr symbol as shoun in Figure 1.7 (b). Each data point has a mernbership grade for

the two cluster centers. For instance. the bonom-nght point has a member grade 0.07 for

cluster 1. and 0.93 for cluster 2.

FCM c m be applied to various clustering applications. In this thesis. FCM is used for

clustering the speech features for a certain nurnber of isolated words during the training

process. For training each word. a nurnber of samples from different speakers are chosen

to form the template. As we know, there are many factors to cause the variability

betn.een different samples for even the sanie words. Therefore. FCM can be used to

ssarching the cluster center for each word.

lsolared Word Speech Recognition Using F z q Neural Techniques Page 59

Chapter 5

Fuzzy Speech Recognizer

5.1 Issues on Implementing a Fuzzy Speech Recognizer

5.1. I T h e Norma fization

When irnplementing a speech recognition system. a speech pattern is usually represented

by a spectral sequence on a short-time basis. In most pattern recognition techniques.

These spectral sequences will be compared in order to decide the matching score.

I-Jon-ever. if a word is spoken hvice by the same speaker under the same environment. it

is still very likely that the tsvo samples will have different len_&s. The main reason of

this is that dit'terent renditions of the s m e utterance are seldom pronounced at exactly the

same speed and manner across the whole utterance. To deal with the speaking rate

fluctuation, it is strongly required to normalize the speech signal in order to make

cornparison and decision between patterns.

In the traditional algorithms, one of the waveforms is warped ont0 the time axis of the

other one. Consider hvo speech patterns X and Y which are represented by (xl, s7. ... x ~ , )

and (y!. y?. ... yTv), whsre xi and y, stand for the short-time feature vectors and T,, T,

denote the duration of Pattern X and Y respectively. In real applications, the duration T,

- --

Isolared Cl'ord Speech Recognition Using FIL? Neural Techniques Page 60

Chamer 5: FLZZ Speech Recoanizer

and T,- usually have diffèrent values. The dissimilarity between X and Y shouid be

measured based on solving the problem of normalizing the two sequences into the same

lengths.

In this thesis. the linear tirne normalization method is used for pattern recognition. The

dissimilarity between pattern X and Y is defined as:

(S. 1)

Where i, and i,. are integer numbers which denote the time indices of X? Y; and d(x, y,,~

is a tùnction for dissimilarity rneasurement between nvo vectors. Also. i, and i, should

sa t ise the foIlowing constraints:

By rounding the IenC& of pattern Y to the same length as pattern X. the surnmation of

distance for each vector in equation (5.1) is defined as the dissimilarity of X and Y.

Depending on the direction of the time normalization. the surnrnation c m be taken from

i! =1 to T, as well. Figure 5.1 illustrates how linear time normalization works for the

index conversion.

Isofared WorJ Speech Recognition Using Ftcy Neural Techniques Page 61

Figure 5.1 Linear time normalization for two sequences with different length

1.2 '... T,. + 1.2 .... T,

lsolcred IVord Speech Recognition Using F=y Neural Techniques Page 62

Chamer 5.- FZY Speech Recopnizer

5 . 2 Teniplate Training

The tsmplate-based method is used to implement the recognition system in this thesis.

As shoun in Figure 5.2. The feature vectors of an unknown word are fed into the

recognition network as the input. By computing the dissimilarity between the input

feature and each speech template. the nenvork can eventually decide the identity of the

unknou-n word with the decision algorithms.

Template (1) 1 Feature Vectors

Template (c) c

a 9 Decision a Rule b

1-4 , Recognized Word

Figure 5.2 Template-based word recognition system

~sulured Il brd Speech Recogntr ion Using FL-zy Neural Techniques Page 63

Chamer 5: FE,^: Speech Reco.cnizer

Before appl y ing the pattern cornparison technique according to Figure 5 2. firstl y the

templates should be trained and saved into a group of buffers which act like a memory

storing the related "dictionary". Assuming there are totally c words in the recognition

l i b r q . it rneans that c templates need to be trained. where each word template is

represented by the tirne-frequency feature.

Because the decision results rely on the templates very much. it is very critical to obtain

high quality templates that could represent the word features accurately. As described in

Chapter 2. the difficulties of speech recognition are mainly caused by al1 kinds of

in\-ariance of speech signals. Therefore. the ideal templates should be able to mode1 and

include the time-frequency informôtion of the speech signal with al1 the possible

fluctuations during training such as:

Speaker fluctuation

Different speaking rate

Di fferent manner of utterance

Environment noise

I-io\ve\.-er. it is an extremely difficult task to take care of ail the fluctuations in a real

implementation. Based on the fact that the most important variations are the speaker

fluctuation and speaking rate fluctuation. the clustering method will concentrate on

dealing with these two problems. Therefore. the training sets should contain the speech

signal taken from several speakers with di fferent speaking rates.

lsolured Il'ord Speech Recognirion Using Fuzzy Neural Techniques Page 64

The classical methods for template training include hard c-means clustering. self-

orsanizing map. and LVQ etc.. In this thesis. the FCM algorithm is used for clustering

the training sarnples and locating template centers because it offers the advantage of

modeling the speech fluctuations eficiently.

In the esperiments. recognition is perfonned using the fuzzy neural techniques for pattern

matching The membership functions are trained and used as the nenvork weight. TLVO

networks are developed based on measuring the similarity and dissimilarity respectively.

More details of these two methods are introduced in the following sections.

The basic idea of the fùzzy networks is to use the membership h c t i o n for classiljling the

~k-ord patterns that consist of the time-frequency feature. To illustrate the theory. let's

start fiom a simple example based on the typical parameter of vowels - formant

frequencies. The formants are defined as the resonant fiequencies of the vocal tract. and

it is known that the first three formant fiequencies could decide the characteristics of a

\.o\vel. Therefore. the membership fünctions should have three peaks. with each peak

correspond to one formant. To generalize the membership functionl the peak values of

Isolured Ilford Speech Recognition Using Fu:? Neural Techniques Page 65

membership function are normalized by l/3 (Figure 5.3). I f a11 formants o f an unknown

pattern can match the peaks of a mernbenhip function rxactly- then the membership

degree should be one. On the other hand. if the unknown pattern doesn't match the

membership function or has shifi from the center. ir should get low membership degree.

The degree D is denoted by:

o = [ ? Z ( f ) . .(fMf

Where y ( f ) indicates the location o f formantsfi. ~ 5 , as:

,~W=W-f,)+W-.L)+W-f,)

Figure 5.3 (a) Formant fi-equencies of a vowel

(b) Membership function of a vowel

lsolarrd llzord Speech Recognition Using Fu'~J Neural Techniques Page 66

Chopter 5: FLT: Speech RecoenCer

Because line spectnim frequencies can provide the formant information. LSFs are used to

form the feature vectors and membership fünctions in the recognition network.

.Assuming the order of LPC mode1 is 10. then there are 10 line spectrum frequencies F =

1 f i . i',. ... . f ioJ . The msmbership function can be constructed with rectangular or

Gaussian-shaped function as shown in Figure 5.4. The Gaussian function in t e m of fi

and f7 is given by:

Figure 5.4 (a) Rectangular shape membership function

(b) Gaussian-shaped membership hnction

/solared IVord Speech Recognition Using Fi- ~Veural Techniques Page 67

Chamer 5: Ftc-Y Speech Recoqnizer

The input vector x(f) of a speech frame is also constructed by LSFs in rectangular shape.

In the recognition network as shown in Figure 5.5. similarities between the unknown

featurr r and the template patterns are firstly calculated. then the h o w n pattern is

classitïed into the category which gets the largest similarity score.

Sequence of feature \.sctors (LSF): X

Figure 5.5 F u u y neural nenvork for isolated word recognizer

based on similarity measurement (Network 1 )

Isolnred II ord Speech Recognirion Using Fzc-y Nczrral Techniques Page 68

Cha~rer 5: FUIT Speech Reco-enkt'r

Setwork 2 (Figure 5.6) has a similar structure to that of network 1. but they are based on

di fferrnt decision rules. More specifically. network 1 measures the similarity between

the unknown and the template patterns. then recognizes the word with the maximum

similarity: ~ v h i l e network 2 measures the dissimilarity or distance and takes the minimum

as the n-inner.

Because net\\.ork 1 is based on matching the information of formant frequencies between

the unknoivn and the templates. only the line spectrum frequencies are appropriate to be

used as the time-varying feature for it. In net~vork 2. more coefficients could be adopted

for speech characteristics, such as cepstrum. log area ratio. reflection coefficients, etc.

When an unknown feature matrix X is applied to the network. the recognition process is

sumnlarized as follows:

(1 ) Nonnalize the length of the unknown pattem into the same Iength as each

tenlpIate weight:

( 2 ) Calculate the value of dissirnilarit>- betueen the unknown and al1 templates frame

by frame:

(3) Recognize the unknown word as the pattem which gets the smallest dissimilarity.

-

/solurèd It'ord Speech Recognition Using Faczy ~Veztruf Tcchniqzres Page 69

Sequence of feature \.c-ctors ( LSF. Ce~s tn im)

Figure 5.6 Fuzzy neural network for isolated word recognizer

based on dissimilarity measurement

fsolared I l ord Speech Recognition Using Fczy Nezrral Techniques Page 70

Chamer 3: Fz~tzv Speech Recopnizer

5.2 Speech Databasc

The speech database used for the recognition expenment consists of 10 isolated English

kvords. -411 the ten words are recorded with 8Wz sarnpling rate. 16-bit quantization

precision under laboratory environment. Each word is recorded ten times by ten speakers

(6 male and 4 femaie). Consequently. the speech database has a total number of 1000

utterances. in ~vhich there are 100 utterances for each speaker.

5.3 Simulations and Results

In this thrsis. the line spectrurn frequencies and LPC cepstral coefficients are used as the

speech feature sets. Both speaker dependent and speaker independent recognition are

tested in the esperiment.

Brforr processing. endpoint detection is performed for each utterance. Tlien the speech

signals are pre-smphasized and bloclied into small frarnes with 1 Oms overlapping

benveen adjacent frames. The pre-emphasis factor is set to 0.95. For each frarne. the

Hamming ~vindokv is applied with 3Oms window length; and then speech feature sets are

estracted based on the algorithm of LSF and LPC cepstnun.

/sofard Ifrord Speech Recognition Using FIL?. Neural Techniques Page 71

Chapter 5: Fuzz- S~eech Reco pnixr

In speaker-dependent recognition. the training data consist of 600 utterances from 6

speakers. and the remaining 100 utterances from these 6 speakers are usrd for testing.

Table 5.1 shows the speaker-dependent recognition rate with the techniques described

above.

Table 5.1 Recognition rate for speaker-dependent recognition

1 l John 1 18/30 1 21/30 1 28!30 / 29/30 ) 30130 19/30 1 27/30 1 18/30 1

L

Nehvork 1 1 Network 2 ( L W

1

, Hari 1 23/30 1 25/30 29130 29/30 29/30 29/30 / 29/30 29/30 I 1

1 I l 1

Neiwork 2 Network 2 i (Cepstrurn) / (Weighted Cepstrurn) 1

--

lsolarrd I f 'ord Speech Recognition Using Fzcy Netiral Techniques Page 72

For cornparison. Fikme 5.7 gives the overail recognition rate with FCM and crisp-mean

for al1 the methods. It is shown that FCM performs better than when taking the crisp

mecm \.due as templates.

-t Crisp -+ FCM

Figure 5.7 Cornparison of the speaker dependent recognition rate

wîth FCM and crisp means

It is shown in Figure 5.8 that network 2 yields better recognition rates than network 1

bscause the dissimilarity is utilized for decision rnaking, which should be more accurate

for distinguishing confusing words than when similarity measurernent is used.

fsofared IC'ord Speech Recognition Cising Fu? Neural Techniques Page 73

Figure 5.8 Speaker dependent recognition rate using LSF

with network 1 and network 2

/ . s o / u ~ c ~ Ilord Speech Recognition Using Fu-' Neural Techniques Page 7-1

Chapter 5: FLT Speech Recopnizer

Speaker-independent recognition uses 600 utterances from 6 speakers (3 female and 3

male), The remaining 400 utterances from other 4 speakers are used as test datz. Table

5.2 show-s the recognition accuracy for d l the words.

Table 5.2 Recognition rate for speaker-independent recognition

/ Network 2 Networh 2 1 1 Network 1 Nehvork 2 (LSF) (Cepstrum) / (Weighted Cepstrum) 1

i

l FCM ( cnsp 1 FCM i 39/40 1 35/40 38/40 i

/ Crisp 1 FCM / cnsp 1 FCM 1 c n s p

I I 1

In speaker-independent recognition. it is also proved that FCM yield better resuit than

crisp mean for template training (Figure 5.9). Figure 5.10 gives the cornparison of

netneork I and network 3 using LSF as speech features.

i Wayne 1 33/40

j John

-- - -

IsolareJ Il'ord Speech Recognition Using Ftrz=y Neural Techniques

34/40 138140 138140 137140

l

Page 75

19/40 25/40 W 4 0 31/40 28/40 33/40 / 29/40 1 31/40 !

36/40

36/40

34/10

I Tracy / 34/40 1 35/40 , 32/40 -

1

40140 1 36/40 1

l

35/40 j

--

3 O O 3&0

; Halima / 35/40 30140

31/40

No / 26/10 1 27/40 36/40 3 / 4 0 1 35/40

36/10 3 / 4 0

32/40

34/40

! Yrs / 16/40 118140

35/10

28/40 34/40

4 / 4 0 1 32/40 ?

35/10 1

Figure 5.9 Speaker-independent recognition rate with FCM and crisp means

Iso1arc.d IFord Speech Recognition Usir~g F q ~Veurai Techniques Page 76

Figure 5.10 Speaker-independent recognition rate using LSF

with netw-ork 1 and network 2

fsolarrd Il ,'ord Speech Recognition Using F i c q ~ Nelrrai Techniques Page 77

Chamer 6: Conclusions and Furure IVorh

Chapter 6

Conclusions and Suggestion for Future Work

6.1 Conclusions

This thesis esplored the issues involved in designing an isolated word recognition

spstem. especially the application of f u u y neural algoriduris for speech pattern

recognition. The LPC speech analysis method is described. and different representation

parameters are compared. It is shoun that the cepstral coefficients and line spectrum

frrquencies pla?. important roles as speech features in recent research and applications of

speech processing.

ic'aturally. fuzzy logic is similar to the way of human thinking. Fuvy sets are

successtU11y applied for speech recognition due to their ability to deal with uncertainty.

However. there's always a balance between "fuzzy" and "too fuzq". The idea of "fuzzy"

is good for modeling the uncertainty and variance of speech signak. But if it's "too

î ü u y " . it is highly probable that it wi l l cause a lot of confusion between the patterns

\vhich arc similar to each other but actually different. For instance. the word "bad" and

"bed" are very similar to each other. and fiizzy logic may not be distinguishable enough

for this case. Therefore. the neural networks are introduced to incorporate with fUzzy

logic to overcome this problem. -4s we know. neural networks simulates the "hardware"

/so/clted I f'ord Speech Recogrzition Using FUIT hreziral Techniques Page 78

Chaacer 6: Conclusiom and Furzue CVorks

of the human brain (human nerve) and have been known as a technique wih great

advantages of fault tolerance and robustness.

In this thesis. two fuuy networks have been proposed and applied for isolated word

recognition. The membership Funcrions are constructed by rneans of superposition of

speech features for each enrolled word and the templates are learned based on the

membership functions. The ternplate includes the fluctuation information of frequency

and time. By using these templates. the recognition system is able to recognize spoken

u-ords independent of the speaker.

The analysis and results of the recognition technique reveals that the use of fuvy logic

and neural networks can consistently improve the performance of the system. From the

reçults. FCM has been shown to be a better template-training algorithm than hard

clustering.

/solcrred Il'ord Speech Recognirion Using FIC? Neural Techniques Page 79

Chaprer 6: Conclrtsions and Future IForks

6.2 Suggestions for Future Work

The results in the thesis have proved the potential of fkq theory and neural nehvorks for

speech recognition. Btised on the proposed methods. It is still possible to improve the

s>-stenl m d get higher recognition rate.

.An important issue in the f m y neural network area is to find efficient combinations of

ANNs inspired by the structure of the human cortex because it forms the most intelligent

speech rscognizer so far. Also. it is certainly a promising direction to simulate the

natural mode1 of speech perception and productionT for both the feature estraction and

pattern recognition part.

Since some other techniques have aIready been successhlly used for speech recognition.

more efficient and integrated systems could be constnicted by con~bining fuzzy neural

techniques n-ith other formalisms. such as HMMs and DTW.

fsolareii [Ford Speech Recognition Using Ftczy iVezwal Techniques Page 80

References

Jean-Claude Junqua. Jean-Paul Haton. Robrîsrness in auromntic speech recognition

firndurnenrals and applications. Kiuwer Academic Publishers, 1996.

La\\~ence Rabinar. Bine-hwang Juang, Fzrndamentals of speech recognition.

Prentice Hall. Englewood Cliffs. 1993.

Joseph P. Campbell. JR.. "Speaker recognition: A tutorid". Proceedings of IEEE.

Vol. 85. No. 9. September 1997.

Lan-rence Rabinar. Ronald W. Sc hafer. Digital processing of speech signais,

Prentice Hall. Inc., Englewood Cliffs. NJ. 1978.

C. H. Chen. FE-q logic and neural nenvork handbook. McGraw-Hill Inc.- 1996.

F. Itakura. "Line spectrum representation of linear predictive coeff~cients of speech

signais". J. .4corîsr. Soc. -4mer. Vol. 57. pp. 53S(a). 1975.

K. K. Paliwal. "A study of line spectrum pair frsquencies for speech recognition".

IC.4 SSP. IEEE International Conference on Aco risr ics, Speech and Signal

Proccssir~g 1988. Vol. 1. pp. 485 - 488.

Samir Saoudi. Jean Marc Boucher. "A new efficient algorithm to compute the LSP

parameters for speech coding", Signal Processing. pp 20 1-2 12, 1992.

Seung Ho Choi. Hong Kook Kim, Hwang Soo Lee and R. M. Gray. "Speech

recognition method using quantised LSP parameters in CELP-type coders".

Electronics Lerters, 22 October, 1997.

/so/ared If ord Speech Recognirion Using Fu=-?: Areural Techniqzr es Page 81

K. K. PalwaI. "-4 study of line s p e c t m pair fiequencies for vowel recognition".

Speech Comrnztnicc~rions 1999, pp. 27-33.

C hi-Shi Liu. Chao-Shih Huang, Min-Tau Lin and Hsiao-Chuan Wang. "Automatic

speaker recognition based upon various distances of LSP frequencies". IEEE

I~zrer~~urioncrl Carnahan Conference on Secrrrit)? Technology Oct.. 199 1 . pp. 1 04-

109.

Frank K. Soong, Biing-Hwang Juang, "Optimal Quantization of LSP parameters".

IEEE Transacrions on Speech ans Aztdio Processing. Vol. 1, No. 1. J a n u q 1993.

N. Naja, J.M. Boucher and S. Saoudi. "Fast LSP vector quantization algorithms

CO m pari son". ;MELECON Proceedings of the 7th Mediterranean Electrotechnical

Conference - iCfELECON, Part 3, Apr. 1994, pp. 1 127 - 1 130.

Teu\.o Kohonen. "The self-organizing Map". Proceedings of the IEEE. Vol. 78,

No. 9. Ssptember 1990. pp. 1464 - 1477.

Eeik McDermontt and Shigeru Katagiri. "LVQ-based shifi-tolerant phoneme

recognition". IEEE Transactions on Signal Processing. Vol. 39. No. 6. June 1991.

pp. 1398- 1410.

Ravi P. Rarnachandran. Mihailo S. Zilovic. and Richard Jo Mammone, "A

comparative study of robust linear predictive analysis methods with applications to

speaker identification", IEEE Transacrions on Speech and Audio Processing. Vol.

3. No. 2, March 1995. pp. 1 17 - 125.

Akio Amano et al., "On the use of neural networks and f b z y logic in speech

recognition". lJCNN In[. JI. Conference on Neural Areiwork-. Jun 18-22. 1989. pp.

301 -305.

/ . S L ) / C Z I L ' ~ I I ord Speech Recognition Cising .Neural Techniques Page 82

Christopher Hale, CarnQuynh Nguyen. "Voice command recognition using fuzzy

logic". Wescon Conference Recor-d Proceedings of rhe 1995 Wescon Coiference,

Nov 7-9 1995. San Francisco. CA. USA, pp. 608-6 13.

Lynn Yaling Cai. Hon Keung Kwan, "Fuzzy classifications using fùzzy inference

networks". IEEE Transacrions on Systerns, Man. and Cybernetics -- Parr Br

Cyber-netics, Vol. 28, No. 3, June 1998, pp. 334-347.

Hon Keung Ktvan. Yaling Cai. Bin Zhang, "Mernbership function Iearning in fuzzy

classification". In[. J. Electronics. 1993. Vol. 74, No. 6, pp. 845-850.

Nicolaos B. Karayiannis. Jarne C. Bezdek. "An integrated approach to fuzm learning tfector quantization and fùzzy c-means clustering". IEEE Transactions on

Fzrzq' Sprenrs. Vol. 5 , No. 4. November 1997. pp. 622-628.

Jyh-Shing Roger Jang and Jiuann-Jyn Chen, "Neuro-fÙzzy and soft computing for

speaker recognition", IEEE International Conference on FZET~ Systems

Proceedings of the 1 99 7 6th IEEE International Conference on Frtzzy Systerns

FCrZZ-IEEE'9 7. Parr 2 (of 3) J d y 1997 iw 2 BarceIona, Spain. pp. 663 - 668.

Jun-ichiroh Fujimoto. Tomofumi Nakatani and Masahide Yoneyyama. "Speaker-

independent word recognition using îùzq pattern matching". Fuz-7 Sers arld

Sirirems 32. I 989. pp. 18 1 - 19 1.

Liushrng Liu. Zhijian Li and Bingsue Shi. "Speech recognition based in fuzzy

vector quantazation and f u u y logic"' IEEE Internarional Conference on Neural

.\i.nr*orks 1.5 1995 Perth, Aust, IEEE Piscuraway iVJ USA, pp. 2858-2862.

Liusheng Liu. Zhijian Li and Bingsue Shi. "Segment matrix vector quantization

and fuzzy logic for isolated-word speech recognition", Proceedings of The

Inter-naîiorïal Sj~mpositrm on A4uZtiple- Valued Logic, 1995, pp. 1 52 - 1 56.

Isolared IjFord Speech Recognition Using F Z L ~ hrezrral Techniques Page 83

[26] James W. Pitton, Kuansan Wang. and Bing-Hwang Juang, "Time-frequency

anal ysis and auditory modeling for automatic recognition of speech". Proceedings

of IEEE. Vol. 84. No. 9, September 1996. pp. 1 199 - 121 5.

[27] K. Davis. R. Biddulph. and S. Balashek. "Automatic recognition of spoken digits".

J. .-lcozrsric. Soc. Am., 1952, 23: pp. 3-50.

[ 2 5 ] J. Suzuki and K. Nakata- "Recognition of Japanese vowels - Preliminary to the

recognition of speech". J. Radio Res. Lab., 196 1, pp. 193-2 12.

[39] P. Denes. "The design and operation of the mechanicd speech recognizer". Journal

ojrhe Bt-irish Insrirrtre of Radio Engineers. 1959. pp. 31 1-229.

[30] T. V intspk. "Speech discrimination by dynamic prograrning", Kibernerika.

Cybet-itatics. pp. 8 1 -88.

[ j 1 j P. Lridefoged. "The phonetic basis for computer speech processing". Cornputer

Speech Processir~g, 1985. pp. 3-27.

[XI L. A. Zadeh. "Fu- setso', Inform. Conrrol. 1 965. pp. 338-352.

l so ln~~*d Iford Speech Recognition Using FE? Neural Techniques Page 84

Vita Auctoris

Xame: Hui PiNG

Place of Birth: Jiangsu. China

k'ear of Birth: 1973

Education: B. Eng.

Department of Electronic Engineering

Nanjing University of Aeronautics and Astronautics

Nanjing, China

1990 - 1994

M. A. Sc

Electrical and Cornputer Engineering

University of Windsor

Windsor, Ontario, Canada

1997 - 1999

isolated word speech recognition using fuzzy … word speech recognition using fuzzy neural...

Documents