voice recognition mfcc
TRANSCRIPT
7
CHAPTER 2
LITERATURE SURVEY
2.1 Adaptive Frequency Cepstral Coefficients for Word Mispronunciation
Detection
In 2011, Sudhendu R. Sharma, Mark J.T. Smith made a system based on automatic speech
recognition (ASR) technology can provide important functionality in computer assisted language
learning applications. This is a young but growing area of research motivated by the large
number of students studying foreign languages. They propose a Hidden Markov Model (HMM)
based method to detect mispronunciations. Exploiting the specific dialog scripting employed in
language learning software, HMMs are trained for different pronunciations. New adaptive
features have been developed and obtained through an adaptive warping of the frequency scale
prior to computing the cepstral coefficients. The optimization criterion used for the warping
function is to maximize separation of two major groups of pronunciations (native and non-
native) in terms of classification rate. Experimental results show that the adaptive frequency
scale yields a better coefficient representation leading to higher classification rates in comparison
with conventional HMMs using Mel-frequency cepstral coefficients.
In this mispronunciation detection project, the talkers are 20 native speakers of English who have
completed one year college-level introductory Spanish course and 20 native Spanish students. 10
Spanish words comprise the corpus. Each speaker pronounces each word 10 times. The human
scoring juries are composed of 22 adult native speakers of Spanish. Scores range from 1 (poor)
to 7 (excellent) based on the level of mispronunciation. In the training of correct and incorrect
pronunciation groups, outliers, such as samples from non-native speakers pronounced well
enough (closer to the mean score of the native group) or vice versa, are removed so that the
samples within each group are more homogeneous.
SRI SATYA SAI INSTITUTE OF SCIENCE & TECHNOLOGY, SEHOREIsolated Word Speech Recognition System Using Mel Spectrum and Dynamic Time Warping
LITERATURE SURVEY 8
2.2 Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for
HMM-Based Speech Synthesis Using Two-Pass Decision Tree
Construction
In 2011, Matthew Gibson and William Byrne, made Unsupervised Intralingual and Cross-
Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree
Construction. This paper first presents an approach to the unsupervised speaker adaptation task
for HMM-based speech synthesis models which avoids the need for such supplementary acoustic
models. This is achieved by defining a mapping between HMM-based synthesis models and
ASR-style models, via a two-pass decision tree construction process. Second, it is shown that
this mapping also enables unsupervised adaptation of HMM-based speech synthesis models
without the need to perform linguistic analysis of the estimated transcription of the adaptation
data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised
cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of
such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation
methods deliver performance approaching that of supervised adaptation.
2.3 Proposed Work
Although the HMM based speech recognition systems are fairly accurate, but the amount of
processing done in HMM is not suitable for mobile devices. Thus, there is a need for speech
recognition system capable of recognizing speech using fewer resources than HMM. Although in
a sense that means sacrificing some accuracy, but the resource demanded by HMM can’t be met
by small mobile devices as well as the operating system working on small platform can’t provide
the memory required by HMM algorithm. A more simple way of doing is through dynamic
programming, this is implemented in this project.
The idea for dynamic programming came from the fact that HMM is employed where strict
speaker dependent speech recognition is necessary, not in places where word to word recognition
SRI SATYA SAI INSTITUTE OF SCIENCE & TECHNOLOGY, SEHOREIsolated Word Speech Recognition System Using Mel Spectrum and Dynamic Time Warping
LITERATURE SURVEY 9
based on isolated things is required. A graphical User Interface is proposed based on our work on
dynamic programming, and there is also a facility to store ten different voices.
SRI SATYA SAI INSTITUTE OF SCIENCE & TECHNOLOGY, SEHOREIsolated Word Speech Recognition System Using Mel Spectrum and Dynamic Time Warping