final report on speech recognition project

32
Final Report on Speech Recognition Project Ceren Burçak Dağ 040100531

Upload: ceren-burcak

Post on 28-Dec-2015

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Report on Speech Recognition Project

Final Report on Speech Recognition Project

Ceren Burçak Dağ

040100531

Page 2: Final Report on Speech Recognition Project

Introduction

• The design of a pre-processing, clustering and a classifier blocks of a speech recognition system is aimed in this project. The computations are made in C/C++ by the author herself and the visualization materials are generated in MATLAB. The documentation of the codes is given in the appendix.

Page 3: Final Report on Speech Recognition Project

Pre-Processing Block

Page 4: Final Report on Speech Recognition Project

Silence trimmed

Page 5: Final Report on Speech Recognition Project

RMS applied

Page 6: Final Report on Speech Recognition Project

Hanning windowed

Page 7: Final Report on Speech Recognition Project

FFT taken

Page 8: Final Report on Speech Recognition Project

Relation between Mel and Hertz scales

Page 9: Final Report on Speech Recognition Project

Triangular filters

Page 10: Final Report on Speech Recognition Project

Cepstrum Analysis and Homomorphic Deconvolution

• Nonlinear signal processing technique.

• Useful in speech processing and recognition applications.

• Bogert, Healy and Tukey defined cepstrum and quefrency in 1963.

• Oppenheim (1964) defined homomorphic systems.

Page 11: Final Report on Speech Recognition Project

• «The transformation of a signal into its cepstrum is actually a homomorphic transformation that maps the convolution into addition.»

• Let us have a sampled signal, x[n] that is composed of the sum of the signal v[n] and an echo (shifted and scaled copy) of it:

Page 12: Final Report on Speech Recognition Project

Since the convolution in time domain corresponds to the multiplication in the frequency domain

Take the magnitude of both sides,

Page 13: Final Report on Speech Recognition Project

Nonlinear technique applied in finding the cepstrum is the logarithm. So, take the logarithms of each side,

Since the logarithm of multiplication is just the addition of the terms,

Define:

Page 14: Final Report on Speech Recognition Project

Now if go back to time domain, we should use I-DTFT.

Finally, one can obtain the following quefrency domain equation:

cepstrum

Page 15: Final Report on Speech Recognition Project

Speech Production Model based on Cepstrum Analysis

• Voiced sounds are produced by exciting the vocal tract with quasi-periodic pulses of air flow caused by the opening and closing of the glottis.

• Fricative sounds are produced by forming a constriction somewhere in the vocal tract and forcing air through the constriction so that the turbulence is created and therefore producing a noise-like excitation.

• Plosive sounds are produced by completely closing of the vocal tract, building up pressure behind the closure, and then suddenly releasing the pressure.

Page 16: Final Report on Speech Recognition Project

Figure 17: Discrete-time speech production model, picture courtesy of Oppenheim, Discrete-Time Signal Processing, [5].

Page 17: Final Report on Speech Recognition Project

Parameters in the model

• 1. The coefficients of V(z), or mathematical representation of the vocal tract which is simply a general IIR filter. So, the locations of poles and zeros change the sound.

Page 18: Final Report on Speech Recognition Project

• 2. The mode of excitation of the vocal tract system: a periodic impulse train or random noise.

• 3. The amplitude of the excitation signal.

• 4. The pitch period of the speech excitation for voiced speech, namely the frequency of the voiced sound.

Page 19: Final Report on Speech Recognition Project

Let us assume that the model is valid and fixed over a short time period of 10 ms, so we can apply the cepstrum analysis to a short segment of length L (=1024) samples.

Apply window to the resulting signal in order w[n] to taper smoothly to zero at both ends. Therefore, the input to the homomorphic system will be,

Page 20: Final Report on Speech Recognition Project

If we further assume w[n] varies slowly with respect to the variations of v[n], the cepstrum analysis reduces to,

If the p[n] is a train of impulses,

Page 21: Final Report on Speech Recognition Project

By applying cepstrum analysis, we obtain the following equation.

Page 22: Final Report on Speech Recognition Project

MFCC and delta coefficients calculation

Page 23: Final Report on Speech Recognition Project

Clustering and Classification

• K-Means clustering is applied to each training file to generate the confusion matrix and tables.

• KNN is applied to recognize some test words.

Page 24: Final Report on Speech Recognition Project

Vowels, unequal a-priori probabilities

Page 25: Final Report on Speech Recognition Project

Vowels, equal a-priori probabilities, each has 97 feature vectors

Page 26: Final Report on Speech Recognition Project

Vowels, equal a-priori probabilities, each has 194 feature vectors

Page 27: Final Report on Speech Recognition Project

Consonants, unequal a-priori probabilities

Page 28: Final Report on Speech Recognition Project

Consonants, equal a-priori probabilities, each has 194 feature vectors

Page 29: Final Report on Speech Recognition Project

Confusion table for consonants

Page 30: Final Report on Speech Recognition Project

KNN classification

Page 31: Final Report on Speech Recognition Project
Page 32: Final Report on Speech Recognition Project

References • [1] Numerical Recipes in C++: The Art of Scientific Computing. William Press. Saul Teukolsky. William

Vetterling. Brian Flannery. 2002.

• [2] S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the Measurement of the Psychological Magnitude Pitch, J. Acoust. Soc. Am. Vol. 8, issue 3, pp. 185-190, 1937.

• [3] Huang, X., Acero, A. and Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.

• [4] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Ceptral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal of Computing, V. 2, i. 3, p. 138-143, 2010.

• [5] Oppenheim A. V., Schafer, R. W., \emph{Discrete-Time Signal Processing}, Pearson International 3. Edition.

• [6] Davis S. B., Mermelstein, P., Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, Haskins Laboratories, Status Report on Speech Research, 1980.

• [7] J. Ye, Speech Recognition Using Time Domain Features from Phase Space Reconstructions, PhD thesis for Marquette University, Wisconsin - US, 2004.

• [8] An Introduction to Speech Recognition, B. Plannerer, 2005.

• [9] R. O. Duda. P. E. Hart and D. G. Stork, \emph{Pattern Classification}, John Wiley \& Sons. 2000.

• [10] L. Rabiner \& B.-H. Juang, \emph{Fundamentals of Speech Recognition}, Prentice Hall Signal Processing Series.

• [11] H. Artuner, The Design and Implementation of a Turkish Speech Phoneme Clustering System, PhD thesis for Hacettepe Universitesi - TR, 1994.