locus report final

31
i LOCUS 2014 A REPORT ON LOCUS PROJECT SPEAKER RECOGNITION By: Ayush Shrestha Bikram Karki Bidhan Sthapit Kumar Shrestha June 20, 2014

Upload: pkrsuresh2013

Post on 14-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

locus

TRANSCRIPT

Page 1: Locus Report Final

i

LOCUS 2014

A REPORT ON LOCUS PROJECT

SPEAKER RECOGNITION

By: Ayush Shrestha Bikram Karki Bidhan Sthapit Kumar Shrestha

June 20, 2014

Page 2: Locus Report Final

ii

ACKNOWLEDGEMENT We would like to express our sincere gratitude towards LOCUS committee. Our Special thanks go to our seniors Mr. Keshav Basyal for providing us valuable materials and suggestions regarding our project. We are indebted towards all our friends for providing important suggestions, advices and encouragements in our project. Ayush Shrestha Bikram Karki Bidhan Sthapit Kumar Shrestha

Page 3: Locus Report Final

iii

ABSTRACT

Speaker recognition, the process of automatically recognizing who is speaking on the basis of individual information included in speech waves, is an important branch of speech processing. Speaker recognition is a frequently overlooked form of biometric security. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers. Speaker recognition can be classified into identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker.

Speech features are classified as either low-level or high-level characteristics. High-level speech features are associated with syntax, dialect, and the overall meaning of a spoken message. In contrast, low-level features such as pitch and phonemic spectra are associated much more with the physiology of the human vocal tract. It is these low-level features that are also the easiest and least computationally intensive characteristics of speech to extract. In the system, for automatic speaker recognition, these features are extracted using Mel-Frequency Cepstral Coefficients (MFCCs) .Once extracted, speaker recognition systems attempt to fit these features best to statistical classification model, Gaussian Mixture Model (GMM). The system implementation is done in MATLAB.

Page 4: Locus Report Final

iv

TABLE OF CONTENTS Contents ACKNOWLEDGEMENT ................................................................................................................. ….ii ABSTRACT ............................................................................................................................................iii TABLE OF CONTENTS ................................................................................................................... …iv 1. INTRODUCTION ................................................................................................................................1 1.1. Background ....................................................................................................................................... 1 1.2. Problem statement ............................................................................................................................ 2 1.3. Objectives ......................................................................................................................................... 2 1.4. Significance ....................................................................................................................................... 3 1.5. Scope ................................................................................................................................................. 3 2. THEORITICAL BACKGROUND .......................................................................................................3 2.1. Speech Signal Excitation .................................................................................................................. 4 2.2. Characteristics of the Speech Signal ................................................................................................. 4 2.3. A Simple Model of Speech Production ............................................................................................. 4 2.4. Speaker Recognition ......................................................................................................................... 5 2.5. Silence Removal................................................................................................................................6 2.6. Pre-emphasis .................................................................................................................................... 7 2.7. Framing ............................................................................................................................................ 7 2.8. Windowing ....................................................................................................................................... 7 2.9. Feature Extraction .............................................................................................................................. 8 2.9.1. Mel Frequency Cepstral Coefficients ............................................................................................ 8 2.9.2. Calculation of MFCCs ................................................................................................................... 9 2.10. Classification and Pattern/Feature Matching ................................................................................. 11 2.10.1. Gaussian Mixture Model (GMM) for Feature matching ............................................................ 11 3. Software Used ..................................................................................................................................... 13 4. System Implementation on MATLAB ................................................................................................ 13 4.1. Voice Capturing and Storage ........................................................................................................... 13 4.2 Silence Removal................................................................................................................................ 14 4.3. Framing and Windowing ................................................................................................................. 15 4.4. Discrete Fourier Transform using FFT and Spectrum ..................................................................... 16 4.5. Mel Scaling ..................................................................................................................................... 16 4.6. Mel Frequency Cepstrum by Inverse Discrete Fourier Transform .................................................. 18 4.7. Feature Matching Using GMM ....................................................................................................... 18 4.8. Speaker Identification/Verification ................................................................................................. 22 5. RESULT...............................................................................................................................................24 6. APPLICATION....................................................................................................................................24 7. LIMITATIONS....................................................................................................................................26 8. CONCLUSION ...................................................................................................................................26 REFERENCES ........................................................................................................................................27

Page 5: Locus Report Final

1

1. INTRODUCTION

1.1. Background An audio signal is a representation of sound, typically as an electrical voltage. Audio signal

range from frequencies from roughly 20 Hz to 20,000 Hz (limits of human hearing). The frequency spectrum of a time-domain signal is a representation of that signal in

the frequency domain. The frequency spectrum can be generated via a Fourier transform of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency.

Biometrics refers to the quantifiable data (or metrics) related to human characteristics and

traits. Biometric identification (or biometric authentication) is used in computer science as a form of identification and access control. It is also used to identify individuals in groups that are under surveillance.

Speech is the most natural way of human communication. The speech signal conveys various

information which are classified as either low-level or high-level characteristics. High-level speech features are associated with syntax, dialect, style and the overall meaning of a spoken message. In contrast, low-level features such as pitch and phonemic spectra are associated much more with the physiology of the human vocal tract that conveys information about the identity of the talker. It is these low-level features that are also the easiest and least computationally intensive characteristics of speech to extract. Speech processing is a diverse field with many applications. Speaker recognition is an important branch of speech processing. It is the process of automatically recognizing who is speaking by using extracted speaker-specific information included in the speech waveform. Figure 1.1 shows a few of these areas and how speaker recognition relates to the rest of the field. Speaker recognition is becoming more ubiquitous as reliance on biometrics for security and convenience increases.

Speaker identification is concerned with identifying a speaker from a pool of possible speakers; such implementations are generally text-independent. MATLAB has been used to develop the system. Mel Frequency Cepstrum Coefficient is used for feature extraction. Gaussian Mixture Model has been used to model the speaker.

Page 6: Locus Report Final

2

1.2. Problem statement

In today’s society, highly accurate personal identification systems are required. Passwords or pin numbers can be forgotten or forged and are no longer considered to offer a high level of security. The use of biological features, biometrics, is becoming widely accepted as the next level for security systems. Fingerprints, Face recognition, Iris recognition, Voice recognition can be used. Multifactor authentication are more robust and are highly secured. Not all the biometrics can be mixed easily with other security system. Further, voice is only biometric that can be used remotely. The remote access of services is the demand of this generation. Much of information is exchanged between two parties in telephone conversations, including between criminals, and in recent years there has been increasing interest to integrate automatic speaker recognition to supplement auditory and semi-automatic analysis methods. Biometric-based speaker identification is a method of identifying persons from their voice. Speaker-specific characteristics exist in speech signals due to different speakers having different resonances of the vocal tract. 1.3. Objectives

The main goal of this project is to develop an automatic text-independent speaker recognition system. The specific objectives can be summarized as: 1.2.1 To extract the characteristic features of a voice signal to represent a speaker. 1.2.2 To match voice features of unknown speaker with the list of registered speakers. 1.2.3 To implement and analyze speaker recognition in MATLAB.

Page 7: Locus Report Final

3

1.4. Significance

Text-Independent speaker recognition system verifies the identity of the speaker only on the basis of the speaker’s voice characteristics without depending on the what is spoken which language is being spoken. It can be used as important Biometric system for the security purpose as voice is only biometric that allows users to authenticate remotely. An important application of speaker recognition technology is forensic applications where there is no control over the speakers to access the system. The speaker's voice can be used to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers using the system. 1.5. Scope

Speaker Recognition based biometric systems have many fields of research and application. The system, “Speaker Recognition” is concerned about verification and identification of the speaker which is accomplished using two phases. In training phase, the system is trained with speech of different speakers and later in testing phase, the system either identifies the speaker from number of stored speakers by matching the features or verify the speaker whether he/she is the one whom he/she claims to be or not. The system is designed and implemented on MATLAB, where the system takes the voice input and with the help of digital signal processing and pattern recognition performed using different tools in MATLAB, speaker recognition is done. 2. THEORITICAL BACKGROUND

A speaker recognition system is comprised of many subsystems which is shown. Signal processing and feature extraction blocks are responsible for conditioning a signal and extracting low-level features (e.g., the spectra of the sound of someone’s voice, pitch, emotional state, gender, etc). Low-level features extracted from short-term spectral analysis are the most widely used kind in text-independent speaker recognition systems. The use of extracted low-level features has proven to be reliable and effective, in addition to being computationally inexpensive. After channel dependent signal conditioning is performed, the front-end analysis extracts spectral features.

Page 8: Locus Report Final

4

2.1. Speech Signal Excitation

Depending upon the articulation, speech signal can be excited in three possible ways:

2.1.1. Voiced Excitation The glottis is closed. The air pressure forces the glottis to open and close periodically

thus generating a periodic pulse train (triangle shaped). This ‘fundamental frequency’ usually lies in the range from 80 Hz to 350 Hz. 2.1.2. Unvoiced Excitation

The glottis is open and the air passes a narrow passage in the throat or mouth. This results in a turbulence which generates a noise signal. The spectral shape of the noise is determined by the location of the narrowness. 2.1.3. Transient Excitation

A closure in the throat or mouth will raise the air pressure. By suddenly opening the closure the air pressure drops down immediately (‘plosive burst’). 2.2. Characteristics of the Speech Signal

The bandwidth of the speech signal is said to be 20 KHz (20 Hz-20 KHz). However, within a bandwidth of 4 KHz, the speech signal contains all the information necessary to understand a human voice. Voiced excitation results pulse train, the so called fundamental frequency. Voiced excitation is used when articulating vowels and some of the consonants. In case of unvoiced excitation, no fundamental frequency can be detected. After passing the glottis, the vocal tract gives a characteristic spectral shape to the speech signal. If one simplifies the vocal tract to a straight pipe (the length is about 17 cm), one can see the pipe shows 11 resonance at the frequencies, so called formant frequencies. Depending on the shape of the vocal tract, the frequency of the formants changes and therefore characterizes the vowel being articulated. The envelope of the power spectrum of speech signal decreases with increasing frequency. The pulse sequence from the glottis has a power spectrum decreasing towards higher frequencies by -12dB per octave. The emission characteristics of the lips show a high pass characteristics with +6dB per octave. Thus, this results in an overall decrease of 6dB per octave.

2.3. A Simple Model of Speech Production

The production of speech can be separated into two parts: Producing the excitation signal and forming the spectral shape. Thus, a simplified model of speech production will be: Figure 2.4: A simple model of speech production

Page 9: Locus Report Final

5

Voiced excitation is modeled by a pulse generator which generates a pulse train with its spectrum given by P(f). The unvoiced excitation is modeled by a white noise generator with spectrum N(f). To mix voiced and unvoiced excitation, one can adjust the signal amplitude of the impulse generator (v) and the noise generator (u). The output of both generation is then added and fed into the box modeling the vocal tract and performing the spectral shaping the transmission function H(f). The emission characteristics of the lips is modeled by R(f). The spectrum S(f) of the speech signal is given as:

S(f)= (v.P(f)+u.N(f)).H(f).R(f)=X(f).H(f).R(f) 2.4. Speaker Recognition

Speaker recognition is concerned with extracting clues to the identity of the person who is the source of that utterance. Speaker recognition encompasses verification and identification. Speaker recognition has been applied most often as means of biometric authentication. 2.4.1. Speaker Recognition Sytem as Biometric System

Biometric systems automatically recognize a person by using distinguishing traits (a narrow definition). Speaker recognition is a performance biometric, i.e. performing a task to be recognized. The voice, like other biometrics, cannot be forgotten or misplaced, unlike knowledge-based (e.g., password) or possession-based (e.g., key) access- control methods. The underlying premise for voice authentication is that each person’s voice differs in pitch, tone, and volume enough to make it uniquely distinguishable. Several factors contribute to this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the size, shape, and tension of the vocal cords. The chance that all of these are exactly the same in any two people is very low. 2.4.2. Speech Parameters used in Speaker Recognition System

Direct computation of the power spectrum from the speech signal results in a spectrum containing ‘ripples’ caused by the excitation spectrum X(f). So, different special transformations are used to separate the excitation spectrum X(f) from the spectral shaping of the vocal tract H(f). Thus, a smooth spectral shape (without the ripples), which represents H(f) can be estimated from the speech signal. The Mel- Frequency Cepstral Coefficients (MFCC) is used in the system to represent the characteristic of a speaker. 2.4.3. Preprocessing of Speech for Speaker Recognition

Pre-Processing of Speech Signal is very crucial in the applications where silence or background noise is completely undesirable. It is the process done on the speech signal before features/parameters are extracted from the signal. Applications like Speaker Recognition needs efficient feature extraction techniques from speech signal where most of the voiced part contains Speaker specific attributes. Pre-processing includes various processes like Silence Removal, Pre-emphasis, Framing, Windowing etc.

Page 10: Locus Report Final

6

2.5. Silence Removal

The speech is classified in three-state representation in which states are (i) silence (S), where no speech is produced; (ii) unvoiced (U), in which the vocal cords are not vibrating, so the resulting speech waveform is aperiodic or random in nature and (iii) voiced (V), in which the vocal chords are tensed and therefore vibrate periodically when air flows from the lungs, so the resulting waveform is quasi-periodic. The segmentation of the waveform into well-defined regions of silence, unvoiced, signals is not exact; it is often difficult to distinguish a weak, unvoiced sound(like /f/ or /th/) from silence, or weak voiced sound (like /v/ or /m/) from unvoiced sounds or even silence. However, it is usually not critical to segment the signal to a precision much less than several milliseconds; hence, small errors in boundary locations usually have no consequence for most applications. Since for most of the practical cases the unvoiced part has low energy content and thus silence (background noise) and unvoiced part is classified together as silence/unvoiced and is distinguished from voiced part. In the system, for the detection of silence/unvoiced part from the speech sample Probability Density Function (PDF) of the background noise and a Linear Pattern Classifier ( uni-dimensional Mahalanobis Distance function ) for classification of voiced part of a speech from silence is used. It is assumed that background noise present in the utterances is Gaussian in nature. The normal or Gaussian probability density function is defined as:

The normal density is ‘bell-shaped curve’, which is completely determined by the numerical values of two parameters, the mean μ and the variance __. The distribution is symmetrical about the mean, the peak occurring at x=μ and width of the bell is proportional to the standard deviation _. Numerically, the probabilities obey P[|x- μ |≤_]=0.68 P[|x- μ |≤ 2_]=0.95 P[|x- μ |≤3_]=0.997 A natural measure of the distance from x to the mean is the distance |x- μ| measured in units of standard deviation which can be analytically expressed as:

Here, r is defined as ‘Mahalanobis Distance’ from x to μ. It is said to be standardized. Clearly, a standardized normal random variable r=(x-μ)/σ has zero mean and unit standard deviation-that is,

Page 11: Locus Report Final

7

2.6. Pre-emphasis

After silence removal the speech signal is pre-emphasized. The pre-emphasis filter emphasizes the high frequencies as they contain speaker dependent information. A pre-emphasis filter is used to eliminate the -6dB per octave decay of the spectral energy. In time domain, assume input speech signal is s[n]. Then the pre-emphasized signal s’[n] is obtained as,

s’[n]=s[n]-a.s[n-1] (2.6) where a=slope of the filter usually 0.9 ≤ a ≤ 1.0

Power Spectral Density before and after preemahasis

2.7 Framing

The analysis of a discrete-time speech signal is based on short-term spectral analysis. This means that the speech signal is blocked into short segments (frames of N samples, with adjacent frames being separated by M (M < N)) in such a way that each one is short enough to be considered pseudo-stationary. The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N - M samples and so on. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec framing and facilitate the fast radix-2 FFT) and M = 100. After framing, these short-length “sub-signals” are considered as independent signals. For each frame, a fixed-length feature vector is computed, which describes the acoustic behavior of that particular frame. 2.8 Windowing Before frequency analysis, a window function is applied to each individual irames to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If the system define the window as w(n ), 0 ≤ n ≤ N − 1 , where N is the number of samples in each frame, then the result of windowing is the signal.

x[n] = s’[n] . w[n-m] , if n=m,m+1,…..m+N-1 = 0 , otherwise (2.7)

Page 12: Locus Report Final

8

The most simple windowing function is the rectangular window, i.e. “no window at all”. However, usually smoother functions are used, and the most common in speech processing is the Hamming window. Smoother functions are better than rectangular window because the latter has abrupt discontinuities in its endpoints, which is undesirable for the frequency analysis. Typically, in the system, the Hamming window is used, which has the form:

Hamming Window

2.9. Feature Extraction

Feature extraction is the process of extracting parameters from the speech signal. In this step the unnecessary information from the speech data are stripped and the important properties of the signal which are important for the pattern recognition task are converted to a format that simplifies the distinction of the classes. Usually, it reduces the dimension of the data and produces the feature vectors. The commonly used feature extraction method for speech/ speaker recognition is LPC (linear prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual Linear Prediction). LPC is based on assumption that a speech sample can be approximated by a linearly weighted summation of determined number of preceding samples. PLP is calculated in a similar way as LPC coefficients, but previous transformations are carried out in the spectrum of each window aiming at introducing about human hearing behavior. In the system, MFCCs are chosen because they are based on the perceptual characteristics of the human auditory system. In addition, Normal speech waveform may vary from time to time depending on the physical condition of speakers’ vocal cords rather than the speech waveforms themselves, the MFCC are less susceptible to the said variations. 2.9.1. Mel Frequency Cepstral Coefficients

Mel Frequency Cepstral Coefficients (MFCC) is widely used for feature extraction from speech signal since they mimic the human hearing behavior by emphasizing lower frequencies and penalizing higher frequencies.

Page 13: Locus Report Final

9

In sound processing, Mel-frequency cepstrum is representation of the short term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. MFCCs are coefficient that collectively make upon MFC. They are derived from a type of cepstral representation of the audio clip. The difference between cepstrum and Mel-frequency cepstrum is that in the MFC, the frequency bands are usually spaced on the Mel scale, which approximates the human auditory system’s response more closely than linearly spaced frequency bands used in the normal cepstrum. Human ear shows linear response below the frequency 1000 Hz and above it shows logarithmic response. 2.9.2. Calculation of MFCCs

MFCCs are commonly calculated by first taking the Fourier transform of a windowed excerpt of a signal and mapping the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. Next the logs of the powers at each of the mel frequencies are taken, Direct Cosine Transform is applied to it. The MFCCs are the amplitudes of the resulting spectrum. The procedure MFCCs are as follows: 1. Take the Fourier transform of (a windowed excerpt of) a signal. 2. Map the powers of the spectrum obtained above onto the Mel scale, using triangular overlapping windows. 3. Take the logs of the powers of each of the Mel frequencies. 4. Take the discrete cosine transform of the list of Mel log powers, as if it are a signal. 5. The MFCCs are the amplitudes of the resulting spectrum. 2.9.2.1. Fast Fourier Transform

The Fast Fourier Transform is an optimized computational algorithm to implement the Discrete Fourier Transform, which represents the components of the signal. The FFT is calculated from windowed signal to convert the signal from a time domain to a frequency domain. The basis of performing a Fourier transform is to convert the convolution of the glottal pulse and the vocal tract impulse response in the time domain into multiplication in the frequency domain. FFT algorithm are based on the fundamental principle of decomposing the computation of the DFT of a sequence of length N into successively smaller DFT. Usually, the sequence is decomposed into two equal subsequences. Again, the subsequences are decomposed into two equal parts and so on. This division of sequence into two equal and smaller

Page 14: Locus Report Final

10

subsequences is the characteristics of a FFT algorithm also called Radix-2 FFT algorithm. This algorithm requires that N be a power of 2. If N is not a power of 2, then the sequence can be zero-padded to make its length a power of 2. The amount of computation is proportional to N log N for FFT algorithms.

The resulting sequence {Xk} is interpreted as: positive frequencies 0 / 2 s ≤ f < F correspond to values 0 ≤ n ≤ N / 2 − 1, while negative frequencies − F / 2 < f < 0 s correspond to N / 2 + 1 ≤ n ≤ N −1. Here, Fs denote the sampling frequency. The result after this step is referred to as spectrum or periodogram. 2.9.2.2. Mel-Scaled Filter Bank

Mel scale is a unit of special measure or scale of the perceived pitch of a tone. It does not correspond linearly to the normal frequency, but behaves linearly below 1 kHz and logarithmically above 1 kHz. This is based on studies of the human perception of the frequency content of sound. The relationship between the frequency (in hertz) and the mel scaled frequency

is,

where, x is the linear frequency. In order to perform mel-scaling, a number of triangular filters or filterbank are used. To implement such filterbanks, the magnitude coefficient of each Fourier transformed speech segment is bounded by correlating them with each triangular filter in the filterbank .

Mel Filter Bank Weight

Page 15: Locus Report Final

11

2.9.2.3. Cepstrum The log mel spectrum has to be converted back to time producing Mel Frequency

Cepstrum Coefficients (MFCCs). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. These MFCCs are real and so they may be converted to the time domain using the Discrete Cosine Transform (DCT). The MFCCs values may be calculated

using following equation, Where n is the index of the cepstral coefficient and 95:7 is the output of an M-channel filterbank. The number of mel cepstrum coefficients, M, is typically chosen as (10-15). The set of coefficients calculated for each frame is called a feature vector. These acoustic vectors can be used to represent and recognize the voice characteristic of the speaker. Therefore each input utterance is transformed into a sequence of acoustic vectors. 2.10. Classification and Pattern/Feature Matching

The problem of speaker recognition belongs to a much broader topic in scientific and engineering so called pattern recognition. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. For the speaker recognition system, sequences of acoustic vectors that are extracted from an input speech are called patterns. The classes refer to individual speakers. Since the classification procedure is applied on extracted features, it can be also referred to as feature matching. Furthermore, if there exists some set of patterns that the individual classes of which are already known, then one has a problem in supervised pattern recognition. This is exactly the case of Speaker Recognition System since during the training session, we label each input speech with the ID of the speaker. These patterns comprise the training set and are used to derive a classification algorithm. The remaining patterns are then used to test the classification algorithm; these patterns are collectively referred to as the test set. For text independent speaker recognition, speaker specific vector quantization (VQ) or the more advanced Gaussian Mixture Model (GMM) are used most often.

2.10.1. Gaussian Mixture Model (GMM) for Feature matching 2.10.1.1 Univariate Gaussian

The Gaussian distribution, also known as the normal distribution, is the bell-curve function. A Gaussian distribution is a function parameterized by a mean: μ and a variance: σ2. The following formula for a Gaussian functions:

2.10.1.2 Mixture Model

In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions:

Page 16: Locus Report Final

12

p(x) = w1p1(x)+ w2p2(x)+….. + wnpn(x) (2.13) The individual pi(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w1, w2,… wn associated with each component are called the mixture weights or mixture coefficients. 2.10.1.3. Gaussian Mixture Model (GMM)

Gaussian mixture model (GMM) is a stochastic model which has become the de factor reference method in speaker recognition by which the low-level features of human voices can be modeled. Gaussian mixture models do not require a speaker recognition system to have prior knowledge of text or other low-level parameters of a given speaker. As such, Gaussian mixture models have become the most widely used text-independent speaker models, trumping vector quantization (VQ), artificial neural networks, hidden Markov models (HMM), and for the time being support vector machines (SVM). The underlying assumption made by speaker recognition systems is that audio samples that come from the same speaker, are sufficiently long and diverse enough phonetically will have highly similar MFCC distributions. Conversely, audio samples from distinct speakers will have largely different MFCC distributions. MFCC distributions, the feature vector sets, are histograms of each MFCCs’s values across an entire audio sample. In speaker recognition systems, Gaussian mixture models provide the statistical tool by which single MFCC feature vector sets can be quantitatively gauged as belonging to a specific speaker’s model. A feature vector is not assigned to the nearest cluster as in, but it has a nonzero probability of originating from each cluster. A GMM is composed of a finite mixture of multivariate Gaussian components. A GMM, denoted by λ, consisting of three vectors of parameters: means ( μ), variances (σ), and weights ( wm ), is characterized by its probability density function:

where, K is the number of Gaussian components, Pk is the prior probability (mixing weight) of the kth Gaussian component, and

is the d-variate Gaussian density function with mean vector μk and covariance matrix Σ k. The prior probabilities Pk≥ 0 are constrained as

For numerical and computational reasons, the covariance matrices of the GMM are usually diagonal (i.e. variance vectors), which restricts the principal axes of the Gaussian ellipses in the direction of the coordinate axes. Estimating the parameters of a full-covariance GMM requires, in general, much more training data and is computationally expensive. Once trained, the λ model is a statistical representation of a speaker’s voice and is used to determine if the set of feature vectors from an audio sample are produced by the same speaker.

Page 17: Locus Report Final

13

3. Software Used 3.1. MATLAB

MATLAB (Matrix Laboratory) is a numerical computing environment and fourthgeneration programming language, developed by MathWorks. It allows matrix manipulations, plotting of functions and data, implementation of various algorithms, creation of user interface and interfacing with program written in other languages including C, C++, Java, and FORTRAN. MATLAB is used to implement and analyze the system. The speech signal is recorded from microphone and processed in MATLAB, then all procedures of the feature extraction and feature matching have done by making various functions in MATLAB.

4. System Implementation on MATLAB

The system for speaker recognition is implemented and analyzed in MATLAB. The task of implementaion is divided into two phases. In the first phase i.e. enrolment or training phase, different speakers are registered and each speaker provide samples of their speech so that the system can build or train a reference model for particular speaker. The second phase is operational or testing phase, in which speaker provide samples of his/her speech and the input speech is matched with stored reference model and a recognition decision is made finally. The result can be either the display of the name of the recognized speaker or the decision made, whether he/she is the one who he/she claims to be. The processes involved for the Software Implementation of the system is discussed below. 4.1. Voice Capturing and Storage

Initially, the acoustic sound pressure wave is transformed into a digital signal suitable for voice processing. A microphone is used to convert the acoustic wave into an analog signal. The input speech signal is then processed and the output is saved using the name entered by him/her

Page 18: Locus Report Final

14

as username in a specific folder in the computer for storing recorded voice of different speakers. The sound is used in format of 22050Hz, 16-bits PCM, Mono Channel. 4.2. Silence Removal

The silence present must be removed before further processing since the captured audio signal may contain silence at different positions and if silent frames are included, modeling resources are spent on parts of the signal which do not contribute to the identification. In the system built on MATLAB, the algorithm developed by G. Saha Sandipan Chakroborty, Suman Senapati in their report entitled “A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications”, is used. The algorithm is divided into two parts. First part assigns label to the samples by using a statistical properties of background noise while the second part smoothens the labeling by the physiological aspects from the speech production process. The Algorithm two passes over speech samples. At first, in Pass I (Step 1 to 3), the statistical property of background noise is used to make a sample as voiced or silence/unvoiced and in Pass II (Step 4 and 5) the physiological aspects of speech production is used for smoothening and reduction of probabilistic errors in statistical marking of Pass I.

4.2.1 The Algorithm Step 1: Calculate the mean and standard deviation of the first 1600 samples of the given utterance. If μ and σ are the mean and the standard deviation respectively then analytically the system can write,

Background noise is characterized by this μ and σ. Step 2: Go from 1st check whether one-dimensional Mahalanobis distance function .

Analytically, the sample is to be treated as voiced sample otherwise it is an silence/unvoiced. Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech signal into 10 ms nononly zeros and ones. Step 4: Consider there are then convert each of ones to zeros and vice versa. This method adopted here keeping in mind that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot change abruptly in a short period of time window taken here as 10 ms.

Page 19: Locus Report Final

15

Step 5: Collect the voiced part only according to the labeled ‘1’ samples from the windowed array and dump it in a new array. Retrieve the voiced part of the original speech signal from labeled 1 sample. sample to the last sample of the speech recording. In each sample,

4.3. Framing and Windowing

The statistical properties of the voice are not constant across time. Therefore, the signal is first divided into fixed-length short frames i.e., frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame are made. Then the pre-emphasized samples s’[n] are multiplied with a windowing function w[n] to cut out a short segment or frame of the speech signal x[n] starting from n=m and ending with n=m+N-1, where N is the length of the segment in samples. The frame length can be calculated as:

N=fs.T where T is duration of the frame and fs is sampling frequency. The rectangular window (i.e., no window) can cause problems, when Fourier analysis is done; it abruptly cuts of the signal at its boundaries. Therefore, for the windowing purpose, Hamming window is used in the system since it low side lobe levels in their transfer functions, which shrinks the values of the signal toward zero at the window boundaries, avoiding discontinuities

Page 20: Locus Report Final

16

4.4. Discrete Fourier Transform using FFT and Spectrum Each frame of N samples from the time domain are converted into the frequency domain

by Discrete Fourier Transform and to implement the Discrete Fourier Transform (DFT), FFT is used. The Xk’s obtained using the equation (2.9) are complex numbers and therefore only their absolute values (frequency magnitudes) are considered. The power spectrum of the signal is computed by squaring the magnitude of DFT X[k] i.e. | X[n] |2. 4.5 Mel Scaling The information carried by low frequency components of the speech signal is more important compared to the high frequency components. In order to place more emphasis on the low frequency components, mel scaling is performed. The transformation from linear frequency scale

Page 21: Locus Report Final

17

to Mel frequency scale is done using the relation given. The Mel frequency warping is done by utilizing a filter bank as shown in figure 5.1.6.2. Mel-filter is formed by using certain number of triangle shaped windows in the spectral domain to build a weighted sum over those power spectrum coefficients | X[n] |2 which lie within the windows which reflected the frequency resolution of human ear in the spectral domain.

The width of the triangular filters varies according to the Mel scale, so that the log total energy in a critical band around the center frequency is included. The centers of the filters are uniformly spaced in the Mel scale. about distribution of energy at each Mel scale band. A vector of outputs (13 coefficients) from each filter is obtained for the system.

Page 22: Locus Report Final

18

4.6 Mel Frequency Cepstrum by Inverse Discrete Fourier Transform Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of

each frame. The triangular filter outputs 95:7 are compressed using logarithm, and discrete cosine transform (DCT) is applied. The resulting vector c[q], obtained using relation given in equation (2.11) is called the Mel-frequency cepstrum (MFC), and the individual components are the Mel-frequency cepstral coefficients (MFCCs). From each speech frame 12 features are extracted.

The speech signals for both training and testing purposes are taken through microphone and are processed similarly for feature extration i.e. Mel Frequency Cepstral coefficients. 4.7 Feature Matching Using GMM

After feature extraction, at first different speaker data are taken and system is trained. Next task performed is to take input and to match the features extracted with already existing model in the system and decision is made on the basis of result of feature matching. For feature matching the data obtained from feature extraction i.e. MFCC coefficients are modeled using Gaussian Mixture Model. Mathematically, A GMM is the weighted sum of M Gaussian component densities given by the equation

Page 23: Locus Report Final

19

4.7.1 GMM Training

The goal of speaker model training is to estimate the parameters of the GMM that best match the distribution of training features vectors and hence develop a robust model for the speaker. In the training session the GMM components of a speaker are calculated and stored. The parameters of model i.e. mean, covariance and weight of each cluster is stored. Out of several techniques available for estimating the parameters of GMM, the most popular method Maximum Likelihood (ML) estimation or Expectation-Maximization (EM) is used since it is a well-established maximum likelihood algorithm for fitting a mixture model to a set of training data. EM requires an a priori selection of model order, the number of M components to be incorporated into the model and initial estimate of training parameters before iterating through the training. The aim of the ML estimation method is to maximize the likelihood of GMM, given the training data.

Page 24: Locus Report Final

20

The above computation is done in log domain to avoid underflow so that instead of multiplying lots of very small probabilities, we can simply add them in log domain.

The average log likelihood value is used so as to normalize out duration effects from the log-likelihood value. Also, since the incorrect assumption of independence is underestimating the actual likelihood value with dependencies, scaling by T can be considered as a rough compensation factor. The direct maximization of this likelihood function is not possible as it is a non-linear function of the parameter. So, the likelihood function is maximized using Expectation Maximization algorithm. The basic idea of EM algorithm is beginning with the initial model λu

to estimate a new model λuv_ such that P(X/λuv_) ≥ P(X/λu) (6.10)

The new model λuv_ then becomes the initial model for the next iteration and the process is repeated until some convergence threshold is reached. i.e.

P(X/λuv_) - P(X/λu)<0 (6.11) 4.7.1.1 The Expectation Maximization Algorithm

Page 25: Locus Report Final

21

4.7.1.2. Estimation of initial parameters for training The initializations of the GMM parameters are done as follows:

Mixture weights: 1/mixtureDimension Mean: Random feature vector from training data

Since, it is important to initialize the covariance matrices with rather large variances, to reduce the risk that the EM training gets stuck in some local maximum, larger values are required. K-Means is used in the system for the good estimate of initial estimate of covariance matrix. To set reasonable values for the covariance matrices, an estimate of the covariance of the whole training set, C data is needed.

For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EM iterations, relation in equation (6.19) is used λmX@.

Page 26: Locus Report Final

22

4.8. Speaker Identification/Verification

Page 27: Locus Report Final

23

Page 28: Locus Report Final

24

5. Result

The implementation of the system in MATLAB was tested by asking participants to speak to the system. During training, participants were asked to speak for various length of time. The length of time can be classified as short duration (around 15 seconds) and medium duration (around 30 seconds). During identification, data collected with low training duration was found to be erroneous, while data collected with higher training duration was found to be relatively more accurate. Also longer speech recorded during identification was found to provide better result. The system was erroneous in noisy environment.

6. Applications

There is truly no limit to the applications of speaker recognition. If audio is involved, one or more of the speaker recognition branches may be used. However, in terms of deployment, speaker recognition is in its early stages of infancy. This is partly due to unfamiliarity of the general public with the subject and its existence, partly because of the limited development in the field. Also, there are some deterrents that feed the skeptics, the most important of which are channel mismatch and quality issues. Some of the major applications of speaker recognition have been discussed in the following few sections. 6.1 Financial Applications

Financial data is sensitive and should only be accessed by the owners of the accounts; there are usually a number of procedures which are used by financial companies to establish the identity of the individual (on the telephone or in person). At present, most institutions provide fully automated account information, accessible through the telephone. They usually require customers’ account number and a pin number to establish their identity. Then full access is granted to the account which could be detrimental if the wrong person gains access. Pin numbers have also been limited to 4 digits by most financial institutions to be compatible with an international standard. Many of these institutions also disallow the use of 0 or 1 at the beginning and the end of the pin number, considerably reducing the number of permutations. Add to this the fact that most people use easy-to-remember numbers such as birthdays or important dates in their immediate family and you have a recipe for a simple breach of security.

An important security breach which is hardly considered these days is the possibility of sniffing DTMF sequences by tapping into the telephone line of an individual while pin-based authentications are performed. This is quite simple and does not require much skill. The tapping may be done close to the source (close to the user) or in more serious cases close to the institution performing the authentication. Once the DTMF information is recorded, it may readily be catalogued and used by impostors with dire consequences.

Speaker recognition is a great match for this type of access problem. Combined with speech recognition system, automated and safe access system can be developed. 6.2 Forensic and Legal Applications

Speech has a unique standing due to its non-intrusive nature. It may be collected without the speaker’s knowledge or may even be processed as a biometric after it has been collected for other purposes. This makes it a prime candidate for forensic and legal applications which deal with passive recognition of the speakers or non-cooperative users. There are other biometrics such as fingerprint and DNA recognition that also allow some degree of passivity in terms of

Page 29: Locus Report Final

25

data collection. That is why they have been successfully used in forensics and legal applications. However, they are not as convenient as speech which may be collected, intercepted, and transmitted much more effectively with the existing infrastructure. It may be used to identify a person against a list of suspects. Also, it may be used to look for anomalies such as abrupt events (gun shots, blasts, screaming, et cetera). 6.3 Access Control (Security) Applications

Access control is another place where speech may be utilized as a very effective biometric. Entering secure locations is only a small part of the scope of speaker biometrics. In that domain they compete head-to-head with most other biometrics and possess pros and cons like any other. Where speaker recognition truly excels with respect to other biometrics is in remote access control in which the user is not physically at the location where access should take place. 6.4 Surveillance Applications

Surveillance applications (lawful intercept) are really very similar to forensic applications discussed above. All surveillance applications, by definition, have to be conducted in a passive manner as discussed in the Forensics section. Unfortunately, they can sometimes be misappropriated by some governments and private organizations due to their relative ease of implementation. There have been many controversies on this type of intercept especially in the last few years. However, if done lawfully, they could be implemented with great efficiency. An obvious case is one where a system would be searching on telephone networks for certain perpetrators which have been identified by the legal process and need to be found at large. Of course, speaker segmentation would be essential in any such application. Also, identification would have to be used to achieve the final goal. Essentially, the subtle difference between forensic and surveillance application is that the former deals with identification while the latter requires the compound branch of speaker recognition, speaker tracking. 6.5 Proctorless Oral Testing

Speaker recognition can be used for performing proctorless oral language proficiency testing. These tests take place on a telephone network. The candidate is usually in a different location from the tester. There is also a set of second tier raters who offer supplementary opinions about the rating of the candidate. In one such application, the candidate is matched by the testing office to a tester for the specific language of interest. Most of the time the tester and the candidate are not even in the same country.

The date of the test is scheduled at the time of matching the tester and the candidate. In addition, the candidate is asked to speak into the Interactive Voice Response (IVR) system which is enabled by speaker recognition technology to be enrolled in the speaker recognition system. The speaker recognition system will then enroll the candidate and save the resulting speaker model for future recognition sessions. Once it is time for the candidate to call in for performing the oral exam, he/she calls the IVR system and enters a test code which acts as the key into the database holding the candidate’s test details. The candidate is first asked to say something so that a verification process may be conducted on his/her voice. The ID of the candidate is known from the test code entered earlier, so verification may be performed on the audio.

Page 30: Locus Report Final

26

If the candidate is verified, he/she is connected to the tester and the oral examination takes place. In the process of taking the oral examination, the speaker recognition system which is listening in on the conversation between the candidate and the tester keeps doing further verifications. Since the tester is known, the audio of the candidate may be segmented and isolated by the recognition engine from the conversation, to be verified. This eliminates the need for a proctor to be present with the candidate at the time of the examination which reduces the cost of the test. The conversation is also recorded. 7. Limitations

The system Speaker Recognition has some limitations regarding its performance. They are listed below:

7.1 The duration of speech signal limits the performance of the recognition system. For training speech data less than 15 seconds degrade the performance, similarly the performance is poor for testing speech data less than 5 seconds. 7.2 The intrusion based on voice imitation cannot be detected using the system. Text-prompted Speaker Recognition can solve this limitation by utilizing speaker dependent information and text dependent information. 7.3 The speaker recognition performance increases with increase in number of model order (number of MFCCs). However, this increase the computational complexity and throughput of the system. This can be solved by using optimal number of model order empirically. 7.4 The silence removal process is not efficient since it also removes the unvoiced part of the speech signal. However, unvoiced speech has insignificant amount of speaker dependent features so the effects are minimal. 7.5 The system developed is highly erroneous in noisy environment. 8. Conclusion

Speaker recognition system has been developed using MATLAB. The system has been implemented using Mel Frequency Cepstral Coefficients for feature extraction and Gaussian Mixture Model to model the speakers. Various signal processing algorithms and various machine learning algorithms are studied and implemented successfully. The designed system is trained with limited data. The performance of software implementation of system is satisfactory. The performance of the system can be improved by utilizing various noise reduction removal algorithms and training with large dataset.

Page 31: Locus Report Final

27

11. References 11.1 Fundamentals of Speaker Recognition by Homayoon Beigi 11.2 Digital Signal Processing – Principle, Algorithm and Application by Proakis and Manolakis 11.3 Applications Of Digital Signal Processing To Audio And Acoustics by Kahrs 11.4 Digital Signal Processing - A Practical Guide for Engineers and Scientists by Smith S. W. 11.5 Fundamentals of Speech Recognition by Rabiner and Juang 11.6 Discrete Time Signal Processing by Oppenheim 11.7 Introduction to Probability and Statistics for Engineers and Scientists by S. Ross