voice recognition mfcc

7

CHAPTER 2

LITERATURE SURVEY

2.1 Adaptive Frequency Cepstral Coefficients for Word Mispronunciation

Detection

In 2011, Sudhendu R. Sharma, Mark J.T. Smith made a system based on automatic speech

recognition (ASR) technology can provide important functionality in computer assisted language

learning applications. This is a young but growing area of research motivated by the large

number of students studying foreign languages. They propose a Hidden Markov Model (HMM)

based method to detect mispronunciations. Exploiting the specific dialog scripting employed in

language learning software, HMMs are trained for different pronunciations. New adaptive

features have been developed and obtained through an adaptive warping of the frequency scale

prior to computing the cepstral coefficients. The optimization criterion used for the warping

function is to maximize separation of two major groups of pronunciations (native and non-

native) in terms of classification rate. Experimental results show that the adaptive frequency

scale yields a better coefficient representation leading to higher classification rates in comparison

with conventional HMMs using Mel-frequency cepstral coefficients.

In this mispronunciation detection project, the talkers are 20 native speakers of English who have

completed one year college-level introductory Spanish course and 20 native Spanish students. 10

Spanish words comprise the corpus. Each speaker pronounces each word 10 times. The human

scoring juries are composed of 22 adult native speakers of Spanish. Scores range from 1 (poor)

to 7 (excellent) based on the level of mispronunciation. In the training of correct and incorrect

pronunciation groups, outliers, such as samples from non-native speakers pronounced well

enough (closer to the mean score of the native group) or vice versa, are removed so that the

samples within each group are more homogeneous.

SRI SATYA SAI INSTITUTE OF SCIENCE & TECHNOLOGY, SEHOREIsolated Word Speech Recognition System Using Mel Spectrum and Dynamic Time Warping

LITERATURE SURVEY 8

2.2 Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for

HMM-Based Speech Synthesis Using Two-Pass Decision Tree

Construction

In 2011, Matthew Gibson and William Byrne, made Unsupervised Intralingual and Cross-

Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree

Construction. This paper first presents an approach to the unsupervised speaker adaptation task

for HMM-based speech synthesis models which avoids the need for such supplementary acoustic

models. This is achieved by defining a mapping between HMM-based synthesis models and

ASR-style models, via a two-pass decision tree construction process. Second, it is shown that

this mapping also enables unsupervised adaptation of HMM-based speech synthesis models

without the need to perform linguistic analysis of the estimated transcription of the adaptation

data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised

cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of

such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation

methods deliver performance approaching that of supervised adaptation.

2.3 Proposed Work

Although the HMM based speech recognition systems are fairly accurate, but the amount of

processing done in HMM is not suitable for mobile devices. Thus, there is a need for speech

recognition system capable of recognizing speech using fewer resources than HMM. Although in

a sense that means sacrificing some accuracy, but the resource demanded by HMM can’t be met

by small mobile devices as well as the operating system working on small platform can’t provide

the memory required by HMM algorithm. A more simple way of doing is through dynamic

programming, this is implemented in this project.

The idea for dynamic programming came from the fact that HMM is employed where strict

speaker dependent speech recognition is necessary, not in places where word to word recognition


LITERATURE SURVEY 9

based on isolated things is required. A graphical User Interface is proposed based on our work on

dynamic programming, and there is also a facility to store ten different voices.


voice recognition mfcc

Documents