research: applying various dsp-related techniques for robust recognition of adult and child speakers

Applying Various DSP-Related Techniques for

Robust Recognition of Adult and Child

Speakers.

R.Atachiants

C.Bendermacher

J.Claessen

E.Lesser

S.Karami

January 21, 2009

Faculty of Humanity and Science, Maastricht University

Abstract

This paper approaches speaker recognition in a new way. A speakerrecognition system has been realized that works on adult and child speak-ers, both male and female. Furthermore, the system employs text-dependentand text-independent algorithms, which makes robust speaker recognitionpossible in many applications. Single-speaker classi�cation is achieved byage/sex pre-classi�cation and is implemented using classic text-dependenttechniques, as well as a novel technology for text-independent recognition.This new research uses Evolutionary Stable Strategies to model humanspeech and allows speaker recognition by analyzing just one vowel.

1 Introduction

In the past few years privacy became of bigger importance to people all over theworld. A great factor for this is the rise of the Internet, all private elements ina persons life became easier to adjust. The privacy of people became easier tocopy. Since money is very important to be able to live nowadays, it was stolenvery often, using the internet. Copying cards and the information belonging toit, was easier then ever and still occurrs very often. If it would be able to havea proper system for voicerecognation in combination with a password, maybethe security of our money increases.

The importance to handle these problems lies with modeling the humanspeech. If an algorithm recognizes speech on its own, a person to check thesounds for human speech would not be needed. Some questions arise namely'How can an algorithm know that there is speech?', 'How does an algorithm

1

estimate the noise?', 'How does an algorithm achieve the classi�cation of speechand when to do so?', and 'How can an algorithm notice that there are multiplespeakers?'. These questions leads to an overal problem de�nition, namely: 'Howto identify one or more speakers?'.

To handle this problem the paper starts with detecting speech. This isthe subject of the �rst section, section 2. Here speech is detected by usingan end-point detection algorithm, which recognizes speech, and noise reductionthat uses three ways of �ltering, namely (a) Finite Impulse Response (FIR),(b) Wavelets, and (c) Spectral Subtractions. Combining these three subjectsthe program retrieves a signal that satis�es the properties to be used by thealgorithms for classi�cation: classifying the speaker alone or in a conversation.First the speaker has to be recognized when he is talking alone, this is discussedin section 3. Speaker recognition is done using (a) discrete word selection,(b) Mel-Frequency Cepstral Coe�cients and Vector Quantization, (c) Age/SexClassi�cation, (d) Voice Model and (e) the contradictions that leads to the con-clusion that there is a person speaking or not. In the last part multiple speakerare identi�ed and classi�ed adjusting the methods Framed Multi Speaker Classi-�cation and Harmonic Matching Classi�er. After these sections there is a shortdiscussion about the subjects and then the conclusion are shown.

2 Speech detection

The very �rst step for identifying the speaker is detecting speech. This meansthat the part of the signal that contains speech has to be seperared from thenoise part.

There are two algorithms that can be used to detect speech. The �rst oneis endpoint detection which will be described in subsection 2.2 and the secondalgorithm is noise reduction. More information about the second algorithm canbe found in subsection 2.3.

2.1 Architectural overview

If a signal contains little noise, the end point detection algorithm can e�ectivelydetermine wether the signal contains speach. However, if there is much noise,noise reduction has to be applied to the signal �rst. To estimate the noise levelof a signal the Spectral Subtraction algorithm is used. This estimation is thencompared to the whole signal, resulting in the signal-to-noise ratio (SNR). Ifneeded, one of three noise reduction techniques (FIR, Spectral Subtraction andWavelets) is selected, based on the weighted SNR of each denoised signal. FIRis prefered over Wavelets and Spectral Subtraction, while the use of Waveletsis prefered over Spectral Subtraction. When end point detection is used on theselected denoised signal, it is safe to say that speech can be detected accurately.See �gure 1 for a schematic overview.

2

Figure 1: Architectural Overview speech detection

2.2 Endpoint detection

The endpoint detection algorithm �lters the noise from the begin and end ofthe signal and detects the begin and end of speech. If these two points are thesame it means that the signal contains no speech and only exists of noise.

It is assumed that the �rst 100 ms of the signal contains no speech. Fromthis part of the signal the energy and the zero crossing rate (ZCR) of the noisecan be calculated. Next, the lower treshold (ITL) and higher treshold (ITU)can be calculated as follows:

I1 = 0, 03(maxEnergy=avgEnergy) + avgEnergy

I2 = 4 ∗ avgEnergy

ITL = MIN(I1, I2)

ITU = 5 ∗ ITU

To determine the starting point (N1) and the end point (N2) of the speech,the ITL and ITU are considered. When the energy of the signal crosses theITL for the �rst time, this point is saved. If the energy then goes below theITL again it was a false alarm. However, when it also crosses the ITU it meansspeech was found and the saved point is considered N1, see �gure 2(a). For N2a similar procedure is followed, just the other way around.

Finally N1 and N2 can be determined more precisely by looking at the ZCR.To be more exact, a closer look is taken on the ZCRs of the 250 ms beforeN1. If a high ZCR is found in that interval that is an indication that there is

3

speech and N1 needs to be reconsidered, see �gure 2(b). Similarly N2 can bedetermined more accurately.

Figure 2: (a) Determining N1 and N2. (b) Redermining N1 and N2.

2.3 Noise reduction

Noise reduction is the other algorithm that is used for speech recognition, itcan be done with help of FIR �ltering, spectral subtraction and wavelets, moreinformation about these topics can be found respectively in subsections 2.3.1,2.3.2 and 2.3.3.

2.3.1 FIR �ltering

In signal processing two types of �ltering are used. As the name suggests, theimpulse response of the FIR �lter is �nite. The other type's �lter response isnormally not �nite because of the feedback structure. The FIR �lter is exten-sively used in this project in order to remove the white gaussion noise (WGN)from the signals. The frequencies of the WGN lie mainly in the low frequencyband of the spectrum. A high pass �rst order (FIR) �lter has been applied tostrengthen the amplitude of the high frequencies. This is done by decreasing theamplitudo of the low frequencies up to 20 dB, so the speech becomes strongerand the noise is reduced.

For �ltering the transfer function from thez-domain is used:

H[z] =z − αz

(1)

The standard formula for a transfer function is H(z) = Y (z)X(z) . So it is clear

that Y (z) = z - αand X(z) = z. The working of this formula is shown in �gure3.

A transfer function with all coe�cients within the z-plane are always stable.Therefore αlies between -1 and 1. To get a decrease as high as possible witha �rst-order formula the alpha is set to 0.95. In �gure 4 the poles are shown

4

Figure 3: FIR �lter

following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95.So the FIR �lter is stable because of the place of α in the z-plane.

Figure 4: z-plane

To determine the frequency response of a discrete-time (FIR) �lter, the trans-fer function is evaluated at z = ejωT . From all this the transfer formula used inthis paper for FIR �ltering looks as in formula 2.

H[ejωT ] = 1− (0.95 ∗ e−jωT ) (2)

This is one way of �ltering WGN. In the part about wavelets, 2.3.3, anotherapproach to �lter WGN is explained.

2.3.2 Spectral subtraction

Spectral subtraction is an advanced form of noise reduction. It is used for signalsthat contain Non-Gaussian (arti�cial) noise. After framing and Hamming win-dowing (DSP), endpoint detection is used on every frame to seperate the noiseframes from the frames with speech. From the noise frames a noise estimation

5

of the signal is made. After applying Discrete Fourier Transform (DFT) to thewindowed signal, the noise estimation is simply subtracted from the signal toobtain the denoised frames. Moreover the noise estimation is used to calculatethe SNR later on (see section 2.1).

Finally the inverse DFT is taken and the frames can be reassembled to getthe denoised signal. A schematic overview of the whole process is given in �gure5.

Figure 5: spectral subtraction

2.3.3 Wavelets

As in the section about FIR �ltering already suggested, wavelets are used to �lterthe signal on white gaussian noise. This way of �ltering starts with the originalsignal and the mother wavelet. The mother wavelet could be one of manymother wavelets that are available. In this paper one of the available daubechiewavelets are used, the daubechie 3, used often in Matlab. This mother waveletis recommended by Matlab, the program used to create the wavelet �lter.

Next step in �ltering is the decomposition of the original signal. By �ttingthe mother wavelet to the signal at the smallest scale, the �lter produces whatis called the �rst wavelet detail and a remainder which is called the �rst approx-imation. Then the timescale of the mother wavelet is doubled and again �t tothe �rst approximation. This results in a second wavelet detail and the secondremainder, the second approximation. Doubling the timescale of the motherwavelet is also known as dilation. Dilation and splitting the remainders intoa new detail and approximation part, �gure 6, is continued until the motherwavelet has been dilated to such an extent that it covers the entire range of thesignal. [9]

There are two ways of thresholding, soft- and hard-thresholding. With hardthresholding the signal below a certain threshold is set to zero. Soft thresholding

6

Figure 6: Signal Decomposition

is more complicated. It substracts the value of the threshold from the values ofthe signal that are above that certain threshold. The values below that thresholdare set to zero again. [10] In Matlab this is integrated in the function ddencmpand wdencmp. The function ddencmp de-noises the signal using a threshold andthe way of thresholding de�ned using the sound sample. The function wdencmpuses this threshold value and the soft/hard-thresholding to create a de-noisedsignal. So using these two functions, Matlab generates a denoised signal byitself.

3 Speaker classi�cation

he speaker classi�cation algorithms described in this paper works best on adiscrete words or small signals. First, Discrete Word Selection (DWS) algo-rithm is applied to cut the signal containing the most vowel components. Next,Age/Sex Classi�cation (ASC) algorithm tries to classify the signal in order toreduce computation by eliminating the database samples that should be pro-cessed. Text-Dependent (T-D) speaker detection techniques such as DynamicTime Warping (DTW) and Vector Quantization (VQ) and Text-Independent(T-I) such as Voice Model Algorithm are processed. The contradictions arechecked and if detected, the ASC bias is discarded and the T-D and T-I al-gorithms are computed again. If a speaker is detected, the system proceedsto classi�cation of Multiple speakers, using in parallel two di�erent techniques:Framed Multi-Speaker Classi�cation and Harmonic Matching Classi�er. Thereresults of both are combined to achieve best result. See �gure 7 for schematicoverview.

7

Figure 7: Architectural overview speaker recognition

3.1 Discrete word selection

Discrete word selection is used for two reasons: �rst of all, the techniques usedin the system are mainly valid for discrete speech processing and not so muchfor the processing of continuous speech. This means that the best results willbe achieved when working only with one, isolated group of words. Workingwith discrete speech will also optimize the performance of the system. Thesecond reason for using discrete word selection is as a help for the 'Age/SexClassi�cation' (ASC) block. The ASC block uses physical properties of thehuman vocal tract to classify speech.

The algorithm for discrete word selection is based on V/C/P (Vowel / Con-sonant / Pause) classi�cation algorithm. This algorithm is text independentand composed of four blocks, see �gure 8.

In the �rst block the main features are extracted; in the second block thesignal is framed and classi�ed for the �rst time. Next, the noise level is estimatedand the frames are classi�ed again with an updated noise level parameter.

In order to distinguish a consonant, the V/C/P algorithm proposes the usageof zero crossing rate features and a threshold (ZCR_dyna). In the case whereZCR is bigger than the threshold, the frame can be classi�ed as a consonant.If the frame can not be classi�ed, the energy of that frame will be checked. Is

8

Figure 8: V/C/P classi�cation algorithm blocks.

the energy smaller than the overall noise level, then the frame can be classi�edas a pause. The frame can be classi�ed as a vowel if the energy is larger. Theresults of an example speech clip using V/C/P classi�cation is shown in �gure9.

Figure 9: V/C/P classi�cation of an example speech clip (o:consonant, +:pause,*:vowel).Image from Microsoft Research Asia.[12]

The complete discrete word selection algorithm is implemented as follows:

1. Audio input is segmented into non-overlapping frames of 10ms, whereenergy and ZCR features are extracted.

2. Energy curve is smoothed, using FIR.

3. TheMean_Energy and Std_Energy of the energy curve are calculatedto estimate the background noise energy level, and the threshold of ZCR(ZCR_dyna) as: NoiseLevel = Mean_Energy - 0,75 * Std_Energy

ZCR_dyna = Mean_ZCR + 0,5 Std_ZCR

9

4. Frames are classi�ed as V/C/P coarsely by using the following rules, whereFrameType is used to denote the type of each frame.

If ZCR > ZCR_dynathen FrameType = ConsonantElseif Energy < NoiseLevel,then FrameType = PauseElse FrameType = Vowel

5. Update the NoiseLevel as the weighted average energy of the frames ateach vowel boundary and the background segments

6. Re-classify the frames using algorithm in step 4 with the updated Noise-Level. Pauses are merged by removing isolated short consonants. Vowelwill be split at its energy if its duration is too long.

7. After classi�cation is terminated, select the word with the highest numberof V-frames.

3.2 MFCC and vector quantization

Mel-frequency cepstral coe�cients (MFCCs) and vector quantization (VQ) areused to construct a set of highly representative feature vectors from a speechfragment. These vectors are used to achieve speaker classi�cation.

Frequencies below 1 kHz contain the most relevant information for speech.Hence the human hearing emphazises these frequencies. To immitate this, fre-quencies can be mapped to the Mel frequency scale (Mel scale). The Mel scaleis linear up to 1 kHz, while for higher frequencies it is a logarithmic scale, thusemphasizing lower frequencies. After converting to the Mel scale, the MFCCscan be found using the Discrete Cosine Transform. In this paper 13 MFCCs areobtained from each frame of the speech signal.

Since a speech fragment generally is divided into many frames, this will resultin a large set of data. Therefore VQ, implemented as proposed in [7], is used tocompress these data points to a set of feature vectors (codevectors). In the caseof speech fragments the set of codevectors is a representation of the speaker.Such a representation is called a codebook. Here VQ is used to compress eachset of MFCCs to 4 points. In the training phase a codebook is generated forevery known speaker. These codebooks are saved in the database.

When identifying a speaker from a new speech fragment VQ compares theMFCCs of the fragment to each codebook in the database, as can be seen in 10.The distance between a MFCC and the closest codevector is called its distortion.The codebook with the smallest total distortion of all MFCCs is identi�ed asthe speaker.

3.3 Dynamic time warping

Dynamic Time Warping (DTW) is a generic algorithm, used to compare twosignals. In order to �nd the similarity between such sequences or as a prepro-

10

Figure 10: Matching MFCCs to a codebook

cessing step before averaging them, we must "warp" the time axis of one (orboth) sequences to achieve a better alignment,�gure11.

Figure 11: Two sequences of data, having both overall similar shape but theyare not aligned to the time axis.[11]

In order to compare two speech signals in the system the DTW is applied tothe 13 of Mel-frequency cepstral coe�cients (MFCCs) from the Mel scale andcompared to its database samples.

To �nd a warping path of two sequences of MFCC data, few steps are re-quired:

1. Calculate the distances cost matrix (In this Paper Euclidean distance wasused to compute the cost)

2. Computing the path, starting from a corner of the cost matrix, processingadjacent cells. This path can be found very e�ciently using dynamicprogramming.[11]

11

� W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K < m+n-1

3. Select only in the path which minimizes the warping cost:

DTW (Q,C) = min

√√√√ K∑k−1

Wk

K

4. Repeat the path calculation for each MFCC feature and compute a di�er-ence from each path.

3.4 Age/ sex classi�cation

The ASC block is based on physical properties of speech and the vocal tract andwill pre-classify the input to one of the following 4 categories: male adult, femaleadult, male child, female child. This pre-classi�cation will help the classi�cationalgorithms of the system to classify the speaker more accurately.

The total length of the vocal tract L can be calculated from the �rst harmonicof a sound exiting the closed tube.

L =c

4F(3)

where c is the speed of sound and F the fundamental frequency. Once thelength of the vocal tract has been calculated, it is very straightforward to classifythe length according to age and sex. General assumptions are that an adult has alonger vocal tract than a child and that a male also has a longer vocal tract thana female [1]. For easier implementation of the classi�er, it was chosen to workwith vocal tract length instead of directly with the fundamental frequencies.

Based on [2] the ASC algorithm has been developed and implemented, whichuses LPC to extract the �rst formant out of the signal. Classi�cation is thenbased on heuristic methods, where length intervals for adult female and childmale are divided into sub-bands, allowing to distinguish between these cate-gories.

Implementation-wise it is important to note that a ASC has been imple-mented such that it will only be carried out if the number of samples in thedatabase of the system is larger than the number of classes of speakers. Thisis done to avoid the pre-classi�cation block (ASC) to act as a classi�cationalgorithm and hence disable the classi�cation blocks.

3.5 Voice model

Human speech is produced by expelling air from the lungs into the vocal tract,where the `air signal' is `modeled' into the desired utterance by the glottis, thetongue and the lips, amongst others. Thus, a speech signal can be seen as asignal evolving over time which is formed by certain invasions. In this research,it is proposed to use Evolutionary Stable Strategies (ESS) originating from the

12

�eld of game theory to model human speech and to accurately recognize speakerson a text-independent basis.

In Appendix A, a detailed overview is given of how this theory is developed.Here the general implementation of the algorithm will be discussed.

Finding a solution for the following two research problems is attempted:

1. Find an algorithm that, given an utterance of human speech, determinesa �tness matrix, appropriate strategies and invasions so that the speechutterance is correctly de�ned by the resulting evolution of the populationof the game.

2. Employ the result of goal 1 to achieve speaker recognition, text-independentif possible.

Since the �ltering e�ect of separate speech organs can hardly be distinguished,a lossless concatenated tube model (n-tube model [3][4]) for modeling the vocaltract is assumed instead. The n-tube model also allows sequential modeling ofthe speech utterance and thus solves the problem of parallel e�ects that occurin the vocal tract.

In essence, the algorithm we need will proceed as follows:

1. Determine the number of tubes in the model and their respective equa-tions.

2. Start �lling out the �tness matrix:

(a) Initially it contains the value 2 in position (1,1)

(b) Determine the equation of the signal after applying the �rst �lter.

(c) Determine the elements of the next column of the �tness matrix.

(d) Determine the correct invasion parameters so that the current signalwill become the desired signal as determined in (b)

(e) Repeat steps (b) to (d) until the desired utterance is modeled (untilall tubes have been passed)

3. Store the values from step 2 in a database format that includes elementsof the �tness matrix as well as strategy information and invasions.

In order to analyze the feasibility of this algorithm it is required to delve abit deeper into steps (c)and (d). It is obvious that (c)and (d) are mutuallydependent since the outcome of an invasion will depend of the o�spring param-eters. Furthermore, it has to be determined what strategy to play generally andwhen to invade. Finally, an ESS that will simplify the entire process has to beincorporated.

Let's assume that at every iteration it is decided to carry out a pure invasion;that is, at time step x+e the type of column x will invade the existing population,or more concrete, at that point in time the game will be played with strategy(0,1), where 1 is for the type of column x. In that case, the elements of column

13

x have to be such that �lling them out in equations (A.4) and (A.5) will yieldthe correct population graph.

Using an ESS will help determine at what exact time steps to carry out pureinvasions, since the evolution of the population is then predetermined and thusknown. It is desirable that playing (1,0), where 1 is for the �rst element of the�rst column, is an ESS. Therefore, all other elements in the �tness matrix mustbe smaller than 2.

To tackle the second research goal, it is important to know that the equationsof the �lters will partially depend on the physical model of the speaker. It isthus the question how to extract these parameters from the speech utterance sothat the equations for the �lters can be established.

3.6 Contradictions

Since three algorithms are employed in the single-speaker classi�cation stage,their respective outcome have to be checked on consistency. A list of contra-dictions allows the system to detect inconsistencies as well as indications tomultiple speakers.

In the table above T-D denotes text-dependent algorithms, while T-I de-notes the text-independent algorithm. The system contains two text-dependentalgorithms and one text-independent algorithm. The binary value for T-D isde�ned by the logical AND operation of T-D1 and TD2.

4 Multiple speaker detection

In order to successfully classify multiple speakers in a speech clip, two use-casesshould be analyzed. There are two main types:

1. Non-Overlapping speech where two or more speakers are speaking in dif-ferent time frames (For example a dialogue).

2. Overlapping speech where two or more speakers speaking in both separatetime frames and same time frames (For example a debate).

In this se we discuss a technique for each of those use-cases: Framed Multi-Speaker classi�cation for Non-Overlapping speech and Harmonic Matching Fil-ter for Overlapping speech. Those two techniques are executed in parallel in

14

the system and both results are combined in order to detect the most speakersas possible.

4.1 Framed multi speaker classi�cation

Framed Multi-Speaker classi�cation algorithm is used in the system in order todetect and classify multiple speakers in a speech signal. In order to do this,the whole signal is processed. The algorithm is used on dialogues or other non-overlapping speech clips. It uses single speaker classi�cation techniques in orderto detect each speaker.

Figure 12: FMS classi�cation stages.

The algorithm works in 3 stages as shown in �gure 12:

1. FHM starts with erasing the pauses in the signal and uses this to framethe signal;

2. Loop on each frame and classi�es the frame using the classi�cation tech-niques discussed in the previous section. The text-dependent speaker clas-si�cation as well as the text-independent classi�cation algorithms are used.Also a check for contradiction is done to classify the single speaker asshown in �gure 13;

3. Finally FHM checks the results to extract only distinct speakers.

4.2 Harmonic matching classi�ed

In order to enable the system to recognize speakers in multi-speaker speechfragments with overlapping speech, the Harmonic Matching Classi�er (HMC) isused. The HMC was introduced by Radfar et al. in [5] and separates Unvoiced-Voiced (U-V) frames from Voiced-Voiced (V-V) frames in mixed speech.

15

Figure 13: FMS classi�cation, per frame classi�cation block.

The table above indicates what kind of speech is uttered by the respectivespeaker in each frame category. U-V frames are useful in speaker recognitionof mixed speech, since in such a frame the features of the voiced speaker willdominate. Hence, it will be possible to recognize the speaker for every frame.However, before being able to separate U-V frames from V-V frames, �rst theU-U frames have to be removed from the signal. To achieve this, an algorithmproposed by Bachu et al. [6] is employed, which uses energy and ZCR cal-culations to distinguish unvoiced frames from voiced frames. Unvoiced/voicedclassi�cation is based on heuristic methods. HMC recognizes U-V frames by�tting a harmonic model, given by equation (2), to a mixed analysis frame andthen evaluate the introduced error (3) against a threshold sv (4). This processis repeated for all frames of the mixed signal.

1. Hmodel =∑L(ωi)

l=1 A2lωiW 2(ω − lωi)

2. et = minwi ||Xtmix(w)|2 −Hmodel|

3. σ = mean({et}Tt=1)

where ωi is the fundamental frequency and W (ω)is a window applied to thespectrum. The X component of equation (3) denotes the spectrum of the tth

mixed signal frame.After the U-V frames have been extracted from the mixed speech signal, they

are passed to the Vector Quantization (VQ) block of the system, where everyframe is matched against the relevant database and two speakers are �nally

16

recognized. Our system is currently limited to recognizing maximal 2 speakersfrom a mixed signal, which is an obvious consequence of the limitations of themethods used, especially harmonic model �tting.

5 Test and Results

The output of the program comparing exact same speech �le with existing one.Everything classi�ed perfectly.

��-START PHASE 1Starting Endpoint Detection....ITL: 1.2448..ITU: 6.2242..IZCT: 220.5.EnergyTotal: 384 elements.RatesTotal: 384 elements.Backo�Length: 12Starting FIR...Starting Spectral Subtraction...Starting Wavelets...Entering Phase 1 Select block...START PHASE 2Starting DWS...Starting MFCC...Starting DTW...'Adult Female'Starting VQ...'Adult Female'Starting VM...'Adult Female'Final result:'Adult Female'��Trying to classify di�erent sound �le (same person, same text). Once again,

everything is classi�end and there's no contradictions.��START PHASE 1Starting Endpoint Detection....ITL: 0.66714.ITU: 3.3357.IZCT: 120.EnergyTotal: 384 elements.RatesTotal: 384 elements.Backo�Length: 12Starting FIR...

17

Starting Spectral Subtraction...Starting Wavelets...Entering Phase 1 Select block...START PHASE 2Starting DWS...Saving �le...Starting MFCC...Starting DTW...'Adult Female'Starting VQ...'Adult Female'Starting VM...'Adult Female'Final result:'Adult Female'��-Classifying a poor quality sound �le, VM and VQ classi�es it correctly, but

DTW fails. The contradictions are veri�ed and �nal result is assigned correctly.��-START PHASE 1Starting Endpoint Detection....ITL: 0.2542.ITU: 1.271.IZCT: 120.EnergyTotal: 387 elements.RatesTotal: 387 elements.Backo�Length: 12Starting FIR...Starting Spectral Subtraction...Starting Wavelets...Entering Phase 1 Select block...START PHASE 2 Starting DWS...Starting MFCC...Starting DTW...'Adult Male'Starting VQ...'Child Female'Starting VM...'Child Female'Final result:'Child Female'��-Classifying a poor quality sound �le, this time DTW and VM classi�es it cor-

rectly, but VQ fails. The contradictions are veri�ed and �nal result is assignedcorrectly.

��-

18

START PHASE 1Starting Endpoint Detection....ITL: 1.3699.ITU: 6.8496.IZCT: 120.EnergyTotal: 421 elements.RatesTotal: 421 elements.Backo�Length: 12Starting FIR...Starting Spectral Subtraction...Starting Wavelets...Entering Phase 1 Select block...START PHASE 2Skipping DWS and loading existing one...Starting MFCC...Starting DTW...'Adult Male'Starting VQ...'Child Female'Starting VM...'Adult Male'Final result:'Adult Male'

6 Discussion

The developed system incorporates classical techniques as well as novel tech-niques and is a combination of scienti�cally proven and heuristic methods. Thetechniques used for speech detection and noise reduction are well-known andwidely-used in speech processing applications. The addition of Spectral Sub-traction to this stage of the system is a novel touch that improves the accuracyof further steps.

In the processing and single-speaker classi�cation stage various DSP-relatedtechniques have been combined with new research. Discrete Word Selectionand Age/Sex Classi�cation both rely on existing methods, but are used in anentirely new fashion in our implementation. Digital Signal Processing � whichincorporates Windowing and Framing � and Frequency Analysis (MFCC), onthe other hand, are classical supporting techniques that are used to prepare thesignal for further processing, as is customary in this kind of systems.

Working with pre-classi�cation is very useful for larger databases and doesprovide the user of the system with information about the speaker even if thesystem can �nd no match. Needless to say, the system relies heavily on thephysical model of speech and the vocal tract to accomplish this for adult andchild, male and female speakers.

19

For the actual classi�cation, three algorithms have been selected that �t therequirements of the system the most. An originally planned implementationof Extended Dynamic Time Warping (EDTW), however, had to be reducedto the simple Dynamic Time Warping implementation, due to a lack of time.Extended Dynamic Time Warping applies dimensionality reduction algorithmslike Principle Component Analysis before searching for a cost path, which wouldhave optimized the performance of the system.

The new research that the system incorporates, namely single-speaker, text-independent classi�cation by using Evolutionary Stable Strategies, is a very in-teresting technique that needs further development and testing before its actualuse can be proven.

Multi-speaker classi�cation also a novel heuristic method (Framed Multi-Speaker Classi�cation) for the recognition of multiple speakers in non-overlappingspeech. Harmonic Model Classi�cation is a combination and adaptation of ex-isting methods and is used for recognition in overlapping speech, which is anovelty in its own right that is not easily achieved.

Between several stages of the system, a considerable amount of logic hasbeen incorporated to assure accurate processing of temporary results. The moststriking example of this logic and its use is probably the technique employed todetect multiple speaker in a speech signal. It is implemented via the logical de-coding of the results of the multiple classi�cation algorithms. Of course, for thismethod to be accurate, a reasonable amount of input is necessary. Therefore,the more classi�cation algorithms we have in the system the better the resultwill be. Hence, incorporating EDTW and maybe other classi�cation algorithmsin the system, in addition to the existing algorithms, will prove useful for theswitch to multi-speaker recognition, which is currently partially a task for theuser to carry out manually.

7 Conclusion

In this Paper several techniques to classify/detect single or multiple speakers arediscussed. In conjunction and proper usage, those techniques help to identifyone or more speakers. Tests and results of such system have shown that manyexisting algorithms have di�erent purposes and can only classify a speaker ifseveral conditions are met (for instance text dependent algorithms). Thus, to beable to achieve best results for the speaker classi�cation problem, the algorithmsshould work together and be checked for contradictions of their output.

References

[1] Stevens, K.N., "Acoustic Phonetics', 0262692503, MIT Press, 1998.

[2] Kamran, M. and Bruce, I. C., "Robust Formant Tracking for ContinuousSpeech with Speaker Variability", IEEE Transactions on Audio, Speech,and Language Processing, Vol. 14, No. 2, 2006.

20

[3] Fant, G., "Acoustic Theory of Speech Production", Mouton (The Hague),1960.

[4] Flanagan, J.L., "Speech Analysis, Synthesis and Perception", Springer Ver-lag, Berlin, Heidelberg, 1972.

[5] Radfar, M.H., Sayadiyan, A. and Dansereau, R.M., "A Generalized Ap-proach for Model-Based Speaker-Dependent Single Channel Speech Sepa-ration", Iranian Journal of Science & Technology, Transaction B, Engineer-ing, Vol. 31, No. B3, pp 361-375, The Islamic Republic of Iran, 2007.

[6] Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B. D., "Separation ofVoiced and Unvoiced using Zero-Crossing Rate and Energy of the SpeechSignal", American Society for Engineering Education (ASEE) Zone Con-ference Proceedings, 2008.

[7] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani and Md. Sai-fur Rahman, �Speaker Identi�cation Using Mel Frequency Cepstral Coe�-cients�, 2004

[9] Goring, D. (2006). Orthogonal Wavelet Decomposition. Available:http://www.tideman.co.nz/Salalah/OrthWaveDecomp.html. Last accessed21 January 2009.

[10] Patrick J. van Fleet (2008). Discrete wavelet transformation. New Jersey:John Wiley & sons. 317-350

[11] Keogh E.J., Pazzani M.J. �Derivative Dynamic Time Warping�, 2000

[12] Dong Wang, Lie L., Hong-Jiang Zhang � Speech Segmentation WithoutSpeech Recognition�, Microsoft Research Asia

21

Appendix A: Using Evolutionary Stable Strategies

to Model Human Speech

Let the �air signal" be called the signal s, then it can be modeled by an evolu-tionary game with the following �tness matrix:

This matrix can be extended to contain the e�ects of the speech mdeling, asfollows:

where g,t,l are the deformation signals of the glottis, tongue and lips, respec-tively. The question marks in the matrix represent the amount of deformationone signal evokes in another.This value is obviously dependant on the utterance,which leads us to our �rst conclusion.

Conclusion1:

Evolutionary games can only be used to model discrete speech utterances.Practically this means that this technique will be used to model isolated vowelsand consonants.

Let's clarify the above a bit by considering an evolutionary game consistingof a population of two types, i and j. The game has the following �tness matrix(not bimatrix, since only player 1 gets o�spring):

Now, let's plot the evolution of the population over time for the followingstrategies (or strategy pairs; player 1 and player 2 use the same strategy in eachof the following cases). Note that for this game we assume that all possible re-lations occur during one generation (one element of the population has multipleinter- and intra-type relationships, where applicable). It is also obvious that nodistinction is made between male and female elements; in fact, all elements aregenderless.

22

Applying strategy (1,0) means that the entire population exists of type i

23

exclusively. Since the o�spring is equal to 2, the population will never growbeyond its initial size, namely 2. Strategy (0,1) yields a similar case, where theentire population consists of type j exclusively. However, the o�spring size hereis 4, hence the population will grow over time. The number of relationships thatcan (and will) occur at a certain point tx in time is :

P (tx−1)−1∑n=1

n = 1 + 2 + 3 + ...+ P (tx−1)− 1

which are all possible combinations, except the element with itself and re-versed combinations. This amount of relationships can be calculated using theform:

1 + 2 + 3 + ...+ n =n(n+ 1)

2

which then yields equation (A.2). Finally, the population when using strategy(

12 ,

12

)consists for 50% type i and 50% type j. Equation (A) is an extension of equation(A.2) in order to include all possible relationships. The term

−2 ∗((

P (tx−1)2 − 1

)∗(

P (tx−1)4

))can not be simpli�ed because it originates from the form mentioned above

and hence a standard simpli�cation would yield a wrong result.In this speci�c case equation (A.3) can be reduced to A.3.1 :

P (tx) = offspring(i,i) ∗((

P (tx−1)2

− 1)∗(P (tx−1)

4

))+

offspring(i,j)∗((

(P (tx−1)− 1) ∗(P (tx−1)

2

))− 2 ∗

((P (tx−1)

2− 1)∗(P (tx−1)

4

)))+

offspring(j,j) ∗((

P (tx−1)2

− 1)∗(P (tx−1)

4

))since

12offspring(i,j) +

12offspring(j,i) = offspring(i,j) = offspring(j,i).

Equation (A.3.1) can then further be reduced to (A.3.2)

P (tx) = offspring(j,i) ∗(

(tx−1 − 1) ∗(tx−1

2

)),

which equals equation(A.2), since in this case (i, j) = (j, i) = (i,i)+(j,j)2 .

Let us now consider the e�ect that an invasion would have on the populationgraph. As it happens, the pure strategy pair ((0,1),(0,1)) that we have examinedpreviously, is an Evolutionary Stable Strategy (ESS), because (a) it is a Nashequilibrium and (b) i scores better against j than against itself. (Note that ifwe remove dominated actions from this game, only strategy (0,1) remains.)

24

Consider the same strategies (pairs) again, but now with a pure invasion atsome moment in time.

The general population function is given by equations (A.1), (A.2) and (A.3)respectively until t3, and by (A.4) and (A.5), as detailed below, thereafter:

P (tx) = offspring(i,i) ∗[((Fi(tx−1)) ∗ P (tx−1)− 1) ∗ Fi(tx−1)∗P (tx−1)

2

]+

25

offspring(i,j)

2 + offspring(i,j)

2 ∗[(

(P (tx−1)− 1) ∗(

P (tx−1)2

))−((

(Fi(tx−1) ∗ P (tx−1)− 1) ∗ (Fi(tx−1)∗P (tx−1))2

)+(

(Fj(tx−1) ∗ P (tx−1)− 1) ∗ Fj(tx−1)∗P (tx−1)2

))]+

offspring(j,j) ∗[((Fj(tx−1)) ∗ P (tx−1)− 1) ∗ Fj(tx−1)∗P (tx−1)

2

]

A =∑

y=i,j

(Fy(tx−1) ∗ P (tx−1)− 1) ∗ Fy(tx−1)∗P (tx−1)2

B = Ftype(tx−1) ∗ P (tx−1)

C =offspring(i,j)

2+offspring(j,i)

2

Ftype (tx) =offspring(type,type)∗(B−1)∗

B

2+

C ∗((

(P (tx−1)− 1) ∗(P (tx−1)

2

))−A

)2

P (tx)x = 1...∞

The general deformation function is de�ned by:

D(tx)

0 x < inv

Pinv(tx)− Pno−inv(tx) x ≥ invx = 1...∞

Equation (A.4) consists of three components: the �rst to calculate thenumber of possible combinations (and o�spring after multiplication with theo�spring-factor) of type i, the second for the mixed combinations and the thirdfor combinations of type j. Equation (A.5) is a function called from equation(A.4) and calculates the fraction (the ratio) of a certain type at a given momentin time. This is achieved by calculating the sum of the o�spring of the respectivetype and half of the mixed o�spring, and dividing this sum by the populationnumber.

As can be seen from the 'Type Ratios'-graphs, only in the case of an ESSthe evolution of the population restores and stabilizes over time.

26

research: applying various dsp-related techniques for robust recognition of adult and child speakers

Education

discrete word

frequency

harmonic matching

longer vocal

evolutionary

starting spectral

model human

speaker classication