speech recognition seminar report

32
A Seminar Report On SPEECH RECOGNITION In partial fulfillment of requirements for the degree Third Year Computer Engineering By GAIKWAD SURAJ VITTHAL Exam Seat No. : T-80694222 Roll No. : 22 Under the guidance of Prof. S. R. LAHANE DEPARTMENT OF COMPUTER ENGINEERING University Of Pune Gokhale Education Society’s R. H. Sapat College of Engineering, Management Studies and Research, Nashik - 422 005, (M.S.), INDIA [2012 – 2013]

Upload: suraj-gaikwad

Post on 13-Apr-2015

2.967 views

Category:

Documents


49 download

DESCRIPTION

Seminar Report On The Topic "Speech Recognition" For The Partial Fulfillment Of The Requirements Of The Third Year Computer Engineering Assignments.

TRANSCRIPT

Page 1: Speech Recognition Seminar Report

A Seminar Report

On

SPEECH RECOGNITION

In partial fulfillment of requirements for the degree

Third Year Computer Engineering

By

GAIKWAD SURAJ VITTHAL Exam Seat No. : T-80694222

Roll No. : 22

Under the guidance of

Prof. S. R. LAHANE

DEPARTMENT OF COMPUTER ENGINEERING

University Of Pune

Gokhale Education Society’s

R. H. Sapat College of Engineering, Management Studies and Research,

Nashik - 422 005, (M.S.), INDIA

[2012 – 2013]

Page 2: Speech Recognition Seminar Report

This is to certify that the seminar report entitled “SPEECH RECOGNITION”

is being submitted herewith by “GAIKWAD SURAJ VITTHAL, T-80694222” has successfully completed his/her seminar work in partial fulfillment of requirements for the degree of Third Year Computer Engineering of University Of Pune.

Date: Place: GES COEMSR, NASHIK

Gokhale Education Society’s

R. H. Sapat College of Engineering, Management Studies and Research,

Nashik - 422 005, (M.S.), INDIA

Prof. S. R. LAHANE

Seminar Guide

Prof. N. V. Alone

Head of the Department

Page 3: Speech Recognition Seminar Report

SPEECH RECOGNITION

i

ABSTRACT

Language is man's most important means of communication and speech its primary medium.

Spoken interaction both between human interlocutors and between humans and machines is

inescapably embedded in the laws and conditions of Communication, which comprise the encoding

and decoding of meaning as well as the mere transmission of messages over an acoustical channel.

Here we deal with this interaction between the man and machine through synthesis and recognition

applications.

Speech recognition, involves capturing and digitizing the sound waves, converting them to

basic language units or phonemes, constructing words from phonemes, and contextually analyzing

the words to ensure correct spelling for words that sound alike. Speech Recognition is the ability of

a computer to recognize general, naturally flowing utterances from a wide variety of users. It

recognizes the caller's answers to move along the flow of the call.

Emphasis is given on the modeling of speech units and grammar on the basis of Hidden

Markov Model& Neural Networks. Speech Recognition allows you to provide input to an application

with your voice. The applications and limitations on this subject enlighten the impact of speech

processing in our modern technical field.

While there is still much room for improvement, current speech recognition systems have

remarkable performance. We are only humans, but as we develop this technology and build

remarkable changes we attain certain achievements. Rather than asking what is still deficient, we ask

instead what should be done to make it efficient.

Page 4: Speech Recognition Seminar Report

SPEECH RECOGNITION

ii

TABLE OF CONTENTS

Chapter 1: Introduction

1.1 Introduction…………………………………………………………..……..1

1.2 Speech Recognition…………………………………...…………………….1

Chapter 2: Literature Survey

2.1 Speech Recognition Process……………………………………….………..3

2.2 Structure of Standard Speech Recognition System….……………………...4

2.3 Types of Speech Recognition System…………………………….……..….9

Chapter 3: System Analysis

3.1 Speech Recognition Algorithms……………………………………..…….11

3.1.1 Dynamic Time Warping………………….……….…………….….……11

3.1.2 Hidden Markov Model……………………………………………….…..11

3.1.3 Neural Network…………………………………………………………..12

Chapter 4: Discussion

4.1 Speech Recognition Softwares…………………………………………….14

4.2 Advantages & Disadvantages……………………………………………...18

4.2.1 Advantages.……………………………………………………………....18

4.2.2 Disadvantages……………………………………………………………19

4.3 Applications………………………………………………………………..20

Page 5: Speech Recognition Seminar Report

SPEECH RECOGNITION

iii

Chapter 5: Conclusion & Future Scope

5.1 Conclusion………………………………………………………………....22

5.2 Future Scope………………………………………………...……….…….22

Acknowledgement…………………………………………………………………….24

Bibliography……………………………………………………………………….….25

Page 6: Speech Recognition Seminar Report

SPEECH RECOGNITION

iv

LIST OF ABBREVIATIONS

HMM: Hidden Markov Model

SR: Speech Recognition

SRS: Speech Recognition System

OOV: Out of Vocabulary

DTW: Dynamic time warping

ASR: Automatic Speech Recognition

OS: Operating System

LVCSR: Large Vocabulary Continuous Speech Recognition

IRIS: Intelligent Rival Imitator of SIRI

Page 7: Speech Recognition Seminar Report

SPEECH RECOGNITION

v

LIST OF FIGURES

Figure No. Title Page

No. 1.1 Speech Recognition 2

2.1 Typical Speech Recognition System 4

2.2 Signal analysis converts raw speech to speech frames. 5

2.3 Acoustic models: template and state representations 6

2.4 The alignment path with the best total score identifies the word sequence and segmentation 7

3.1 Simple HMM with two states & two output symbols 11

3.2 Unit activations for neural network 13

4.1 Julius SR Engine Interface 14

4.2 Google Now Interface 15

4.3 Dragon Naturally Speaking Interface 17

4.4 Windows Speech Recognition Interface 17

Page 8: Speech Recognition Seminar Report

SPEECH RECOGNITION

1

CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

Have you ever talked to your computer? (And no, yelling at it when your Internet connection

goes down or making polite chit-chat with it as you wait for all 25MB of that very important file to

download doesn't count). I mean, have you really, really talked to your computer? Where it actually

recognized what you said and then did something as a result? If you have, then you've used a

technology known as speech recognition.

Speech recognition allows you to provide input to a system with your voice. Just like clicking

with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to

an application, speech recognition allows you to provide input by talking. In the desktop world, you

need a microphone to be able to do this.

1.2 SPEECH RECOGNITION

Speech recognition (or sometimes referred to as Automatic Speech Recognition) is the

process by which a computer (or other type of machine) identifies spoken words. Basically, it means

talking to a computer & having it correctly understand what you are saying. By “understand” we

mean, the application to react appropriately or to convert the input speech to another medium of

conversation which is further perceivable by another application that can process it properly &

provide the user the required result.

The days when you had to keep staring at the computer screen and frantically hit the key or

click the mouse for the computer to respond to your commands may soon be a things of past. Today

we can stretch out and relax and tell your computer to do your bidding. This has been made possible

by the ASR (Automatic Speech Recognition) technology.

Speech recognition is an alternative to traditional methods of interacting with a computer,

such as textual input through a keyboard. An effective system can replace, or reduce the reliability

on, standard keyboard and mouse input. This can especially assist the following:

People who have little keyboard skills or experience, who are slow typists, or do not have the

time or resources to develop keyboard skills.

Page 9: Speech Recognition Seminar Report

SPEECH RECOGNITION

2

Dyslexic people or others who have problems with character or word use and manipulation in

a textual form.

People with physical disabilities that affect either their data entry, or ability to read (and

therefore check) what they have entered.

Figure 1.1 – Speech Recognition

Page 10: Speech Recognition Seminar Report

SPEECH RECOGNITION

3

CHAPTER 2

LITERATURE SURVEY

2.1 SPEECH RECOGNITION PROCESS

In humans the speech or acoustic signals are received by the ears & then transmitted to the

brain for understanding & extracting the meaning out of the speech & then to react it appropriately.

Speech recognition enabled computer or devices too, work under the same principle. They receive

the acoustic signal through microphone; these signals are in analog form & need to be digitalized to

be understood by the system. The signals are then digitalized & sent to the processing unit for

extracting the meaning out of the signals & to give the desired output to the user.

Any speech recognition system involves following five major steps:

1. Signal Processing

The sound is received through the microphone in the form of analog electrical signals.

These signals consist of the voice of the user & the noise from the surroundings. The

noise is then removed & the signals are converted into digital signal. These digital

signals are converted into a sequence of feature vectors.

(Feature Vector - If you have a set of numbers representing certain features of an

object you want to describe, it is useful for further processing to construct a vector out

of these numbers by assigning each measured value to one component of the vector.)

2. Speech Recognition

This is the most important part of this process; here the actual recognition is done.

The sequence of feature vectors is then decoded into a sequence of words. This

decoding is done on the basis of algorithms such as Hidden Markov Model, Neural

Network or Dynamic Time Wrapping. The program has big dictionary of popular

words that exist in language. Each feature vector is matched against the sound

&converted into appropriate character group. It checks and compares words that are

similar in sound with the formed character groups. All these similar words are then

collected.

Page 11: Speech Recognition Seminar Report

SPEECH RECOGNITION

4

3. Semantic Interpretation

Here it checks if the language allows a particular syllable to appear after another.

After that, there will be grammar check. It tries to find out whether or not the

combination of words any sense.

4. Dialog Management

The errors encountered are tried to be corrected. Then the meaning of the combined

words is extracted & the required task is performed.

5. Response Generation

After the task is performed, the response or the result of that task is generated. The

response is either in the form of a speech or text. What words to use so as to

maximize the user understanding, are decided here. If the response is to be given in

the form of speech, then Text to Speech conversion process is used.

2.2 STRUCTURE OF STANDARD SPEECH RECOGNITION SYSTEM

Figure 2.1 – Typical Speech Recognition System

Page 12: Speech Recognition Seminar Report

SPEECH RECOGNITION

5

The structure of a standard speech recognition system is illustrated in Figure 2.1. The

elements are as follows:

Raw speech - Speech is typically sampled at a high frequency, e.g., 16 KHz over a

microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over time.

Signal analysis - Raw speech should be initially transformed and compressed, in order to

simplify subsequent processing. Many signal analysis techniques are available which can

extract useful features and compress the data by a factor of ten without losing any important

information.

Figure 2.2 - Signal analysis converts raw speech to speech frames.

Speech frames - The result of signal analysis is a sequence of speech frames, typically at 10

milliseconds intervals, with about 16 coefficients per frame. These frames maybe augmented

by their own first and/or second derivatives, providing explicit information about speech

dynamics; this typically leads to improved performance. The speech frames are used for

acoustic analysis.

Acoustic models - In order to analyze the speech frames for their acoustic content, we need a

set of acoustic models. There are many kinds of acoustic models, varying in their

representation, granularity, context dependence, and other properties. During training, the

acoustic models are incrementally modified in order to optimize the overall performance of

the system. During testing, the acoustic models are left unchanged.

Page 13: Speech Recognition Seminar Report

SPEECH RECOGNITION

6

Figure 2.3 - Acoustic models: template and state representations for the word “cat”.

Acoustic analysis and frame scores - Acoustic analysis is performed by applying each

acoustic model over each frame of speech, yielding a matrix of frame scores, as shown in

Figure 2.3. Scores are computed according to the type of acoustic model that is being used.

For template-based acoustic models, a score is typically the Euclidean distance between a

template’s frame and an unknown frame. For state-based acoustic models, a score represents

an emission probability, i.e., the likelihood of the current state generating the current frame,

as determined by the state’s parametric or non-parametric function.

Page 14: Speech Recognition Seminar Report

SPEECH RECOGNITION

7

Figure 2.4 - The alignment path with the best total score identifies the word sequence and

segmentation.

Time alignment - Frame scores are converted to a word sequence by identifying a sequence

of acoustic models, representing a valid word sequence, which gives the best total score along

an alignment path through the matrix. The process of searching for the best alignment path is

called time alignment.

An alignment path must obey certain sequential constraints which reflect the fact that speech

always goes forward, never backwards. These constraints are manifested both within and

between words. Within a word, sequential constraints are implied by the sequence of frames

(for template-based models), or by the sequence of states (for state-based models) that

comprise the word, as dictated by the phonetic pronunciations in a dictionary, for example.

Between words, sequential constraints are given by a grammar, indicating what words may

follow what other words.

Time alignment can be performed efficiently by dynamic programming, a general algorithm

which uses only local path constraints, and which has linear time and space requirements.

Page 15: Speech Recognition Seminar Report

SPEECH RECOGNITION

8

(This general algorithm has two main variants, known as Dynamic Time Warping (DTW)

and Viterbi search, which differ slightly in their local computations and in their optimality

criteria.)

In a state-based system, the optimal alignment path induces segmentation on the word

sequence, as it indicates which frames are associated with each state. This segmentation can

be used to generate labels for recursively training the acoustic models on corresponding

frames.

Word sequence - The end result of time alignment is a word sequence - the sentence

hypothesis for the utterance. Actually it is common to return several such sequences, namely

the ones with the highest scores, using a variation of time alignment called N-best search.

This allows a recognition system to make two passes through the unknown utterance: the first

pass can use simplified models in order to quickly generate an N-best list, and the second

pass can use more complex models in order to carefully rescore each of the N hypotheses,

and return the single best hypothesis.

Page 16: Speech Recognition Seminar Report

SPEECH RECOGNITION

9

2.3 TYPES OF SPEECH RECOGNITION SYSTEMS

Speech recognition systems can be separated in several different classes by describing what

types of utterances they have the ability to recognize. These classes are based on the fact that one of

the difficulties of SR is the ability to determine when a speaker starts and finishes an utterance. Most

packages can fit into more than one class, depending on which mode they're using.

Isolated Word

Isolated word recognizers usually require each utterance to have quiet (lack of an audio

signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words,

but does require a single utterance at a time. Often, these systems have "Listen/Not−Listen"

states, where they require the speaker to wait between utterances (usually doing processing

during the pauses).

Connected Word

Connect word systems (or more correctly 'connected utterances') are similar to Isolated

words, but allow separate utterances to be 'run−together' with a minimal pause between them.

Continuous Speech

Recognizers with continuous speech capabilities are some of the most difficult to create

because they must utilize special methods to determine utterance boundaries. Continuous

speech recognizers allow users to speak almost naturally, while the computer determines the

content. Basically, it's computer dictation.

Spontaneous Speech

At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An

ASR system with spontaneous speech ability should be able to handle a variety of natural

speech features such as words being run together, "ums" and "ahs", and even slight stutters.

Voice Verification/Identification

Some ASR systems have the ability to identify specific users by characteristics of

their voices (voice biometrics). If the speaker claims to be of a certain identity and the voice

is used to verify this claim, this is called verification or authentication. On the other

hand, identification is the task of determining an unknown speaker's identity. In a

Page 17: Speech Recognition Seminar Report

SPEECH RECOGNITION

10

sense speaker verification is a 1:1 match where one speaker's voice is matched to one

template (also called a "voice print" or "voice model") whereas speaker identification is a 1:

N match where the voice is compared against N templates.

There are two types of voice verification/identification system, which are as follows:

Text-Dependent:

If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.: passwords and PINs) or knowledge-based information can be employed in order to create a multi-factor authentication scenario.

Text-Independent:

Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge, as in the case for many forensic applications. As text-independent technologies do not compare what was said at enrollment and verification, verification applications tend to also employ speech recognition to determine what the user is saying at the point of authentication.

In text independent systems both acoustics and speech analysis techniques are used.

Page 18: Speech Recognition Seminar Report

SPEECH RECOGNITION

11

CHAPTER 3

SYSTEM ANALYSIS

3.1 SPEECH RECOGNITION ALGORITHMS

3.1.1 Dynamic Time Warping

Dynamic Time Warping algorithm is one of the oldest and most important algorithms in

speech recognition. The simplest way to recognize an isolated word sample is to compare it

against a number of stored word templates and determine the “best match”. This goal

depends upon a number of factors. First, different samples of a given word will have

somewhat different durations. This problem can be eliminated by simply normalizing the

templates and the unknown speech so that they all have an equal duration. However, another

problem is that the rate of speech may not be constant throughout the word; in other words,

the optimal alignment between a template and the speech sample may be nonlinear. Dynamic

Time Warping (DTW) is an efficient method for finding this optimal nonlinear alignment

Hidden Markov Model

The most flexible and successful approach to speech recognition so far has been Hidden

Markov Models (HMM).A Hidden Markov Model is a collection of states connected by

transitions. It begins with a designated initial state. In each discrete time step, a transition is

taken up to a new state, and then one output symbol is generated in that state. The choice of

transition and output symbol are both random, governed by probability distributions.

Figure 3.1– Simple HMM with two states & two output symbols

Page 19: Speech Recognition Seminar Report

SPEECH RECOGNITION

12

Formally, an HMM consists of the following elements:

{s} = A set of states.

{푎 } = A set of transition probabilities, where 푎 is the probability of taking thetransition

from state i to state j.

{푏 (푢)} = A set of emission probabilities, where 푏 is the probability distributionover the

acoustic space describing the likelihood of emittingeach possible sounduwhile in state i.

Since 푎 and 푏 are both probabilities, they must satisfy the following properties:

푎 ≥ 0, 푏 (푢) ≥ 0, ∀ 푖, 푗,푢

푎 = 1, ∀푖

푏 (푢) = 1, ∀푖

Neural Networks

A neural network consists of many simple processing units (artificial neurons) each of which

is connected to many other units. Each unit has a numerical activation level (analogous to the

firing rate of real neurons). The only computation that an individual unit can do is to compute

a new activation level based on the activations of the units it is connected to. The connections

between units are weighted and the new activation is usually calculated as a function of the

sum of the weighted inputs from other units.

Some units in a network are usually designated as input units which mean that their

activations are set by the external environment. Other units are output units, their values are

set by the activation within the network and they are read as the result of a computation.

Those units which are neither input nor output units are called hidden units.

Page 20: Speech Recognition Seminar Report

SPEECH RECOGNITION

13

A given unit is typically updated in two stages: first we compute the unit’s net input (or

internal activation), and then we compute its output activation as a function of the net input.

In the standard case, the net input 푥 for unit j is just the weighted sum of its inputs:

푥 = 푦 푤

Here 푦 is the output activation of an incoming unit, &푤 is the weight from unit i to unit j.

Figure 3.2 – Unit activations for neural network.

Page 21: Speech Recognition Seminar Report

SPEECH RECOGNITION

14

CHAPTER 4

DISCUSSION

4.1 SPEECH RECOGNITION SOFTWARES

There are ample of Speech Recognition Softwares available in the market. These softwares

are available for various kinds of platforms including Smart phones, PCs, Tablets etc& are designed

for different Operating Systems as well.

Julius

Figure 4.1 – Julius SR Engine Interface

Open source& Freeware speech recognition engine

Developed by - Nagoya Institute of Technology

Developed in C language.

Operating systems – Unix, Windows

Language available in – Japanese

Page 22: Speech Recognition Seminar Report

SPEECH RECOGNITION

15

High-performance, two-pass large vocabulary continuous speech recognition (LVCSR)

decoder software for speech-related researchers and developers.

Google Now

Figure 4.2 – Google Now Interface

An intelligent personal assistant software

Developed by - Google

Operating System – Android 4.1& later.

Language available in – English

Google Now is implemented as an aspect of the Google Search application. It recognizes

repeated actions that a user performs on the device & to display more relevant

information to the user in the form of "cards".

SIRI

An intelligent personal assistant and knowledge navigator software.

Developed by – Apple Inc.

Operating Systems – iOS 5 & later.

Page 23: Speech Recognition Seminar Report

SPEECH RECOGNITION

16

Platform - iPhone (4S and later),iPod Touch (5th generation),iPad (3rd generation and

later)

Languages available in - English, French, German, Japanese, Chinese, Korean, Italian,

Spanish

The application uses a natural language user interface to answer questions, make

recommendations, and perform actions by delegating requests to a set of Web services.

S Voice

An intelligent personal assistant and knowledge navigator software.

Developed by – Samsung

Operating System – Android 4.0 & 4.1

Platform – Samsung Galaxy S III, Samsung Galaxy Note II, Samsung Galaxy Note 10.1,

and Samsung Galaxy Stellar

Languages available in - English, Arabic, French, Spanish, Korean, Italian, and German

The application uses a natural language user interface to answer questions, make

recommendations, and perform actions by delegating requests to a set of Web services.

Iris (Intelligent Rival Imitator of SIRI)

A personal assistant application for Android.

Developed by –Dextra Software Solutions (Narayan Babu& team, Kochi, India)

Operating System - Android

Developed in 8 hours.

The application uses natural language processing to answer questions based on user voice

request.

Iris can talk on topics ranging from Philosophy, Culture, History, science to general

conversation.

Page 24: Speech Recognition Seminar Report

SPEECH RECOGNITION

17

Dragon NaturallySpeaking

Figure 4.3 – Dragon Naturally Speaking Interface

A speech recognition software package

Developed by - Nuance Communications

Operating System – Windows

The software has three primary areas of functionality: dictation, text-to-speech and command input.

Windows Speech Recognition

Figure 4.4 – Windows Speech Recognition Interface

A speech recognition application

Developed by – Microsoft

Operating System - Windows Vista, Windows 7 and Windows 8

Languages available in - English (U.S. and British), Spanish, German, French, Japanese

and Chinese

Allows the user to control the computer by giving specific voice commands. The program

can also be used for the dictation of text so that the user can enter text using their voice

Has a fairly high recognition accuracy and provides a set of commands that assists in

dictation

Page 25: Speech Recognition Seminar Report

SPEECH RECOGNITION

18

4.2 ADVANTAGES& DISADVANTAGES

4.2.1 Advantages

Increases productivity

By speaking normally into the SRS program, you create documents at the speed you can

compose them in your head. People without strong typing skills or those who don't wish to be

slowed down by manual input can use voice recognition software to dramatically reduce

document creation time.

Can help with menial computer tasks, such as browsing and scrolling

People are becoming lazy day by day. They are also not interested in doing the necessary

routine work even. Previously there where punch cards to provide input to the system,

then there came the keyboard, track ball, touch screen, mouse, gesture control, joysticks

etc; all the previously used input methods require motion of hand or fingers. But, with

SRS user can provide input to the system through just his voice. He can complete most of

his menial computer tasks easily.

Can help people with disabilities

More recently students with learning or physical disabilities have been able to use SRS.

Those with learning disabilities that affect their ability to write can now complete exams via

voice recognition technology, and those with physical disabilities such as upper body

paralysis can use SRS to communicate effectively with others.

Cost effective

In a study of traditional transcription services versus voice recognition software, Dr. Robert

G. Zick and Dr. Jon Olsen found that using SRS had a slightly lower accuracy rate (98.5% v/s

99.7%), but was more cost effective overall.

Diminishes spelling mistakes

Even the most experienced typists will occasionally have a spelling blunder; the average

person is likely to make several mistakes in his or her composition. SRS always provides the

Page 26: Speech Recognition Seminar Report

SPEECH RECOGNITION

19

correct spelling of a word (assuming it translated it accurately in the first place), thus

eliminating the need to spend time running spell checkers.

4.2.2 Disadvantages

Inaccuracy & Slowness

Most people cannot type as fast as they speak. In theory, this should make voice recognition

software faster than typing for entering text on a computer. However, this may not always be

the case because of the proofreading and correction required after dictating a document to the

computer. Although voice recognition software may interpret your spoken words correctly

the majority of the time, you might still need to make corrections to punctuation.

Additionally, the software may not recognize words such as brand names or uncommon

surnames until you add them to the program's library of words. SR systems are unable to

recognize the words which are phonetically similar. E.g. “there” & “their”.

• Vocal Strain

Using voice recognition software, you may find yourself speaking more loudly than in normal

conversation. In 2000, Linda L. Grubbs of PC World magazine reported that this habit could

lead to vocal cord injury. Although there is no definite scientific link established between the

use of voice recognition software and damage to the voice, talking loudly for extended

periods always carries the possibility of causing strain and hoarseness.

• Adaptability

Speech Recognition softwares are not capable of adapting to various changing conditions

which include different microphone, background noise, new speaker, new task domain,

new language even. The efficiency of the software degrades drastically.

• Out-of-Vocabulary (OOV) Words

Systems have to maintain a huge vocabulary of word of different language & sometimes

according to the user phonetics also. They are not capable of adjust their vocabulary

according to the change in users. Systems must have some method of detecting OOV

words, and dealing with them in a sensible way.

Page 27: Speech Recognition Seminar Report

SPEECH RECOGNITION

20

• Spontaneous Speech

Systems are unable to recognize the speech properly when it contains disfluencies (filled

pauses, false starts, hesitations, ungrammatical constructions etc.). Spontaneous speech

remains a problem.

• Prosody

Systems are unable to process Prosody (study of speech rhythms). Stress, intonation, and

rhythm convey important information for word recognition and the user's intentions (e.g.,

sarcasm, anger).

• Accent, dialect and mixed language

Mostly all the systems are made according to the common accent of the particular

language. But the accent of people varies in a wide range. Dialect of the people also

varies according to the regions. Systems are not capable of adjust according to all of these

accent & dialect changes. People also sometimes use mixed language mode for

conversation & mostly SR systems work on a single language model at a time.

5.3 APPLICATIONS

Games and Edutainment

Speech recognition offers game and edutainment developers the potential to bring their

applications to a new level of play. With games, for example, traditional computer-based

characters could evolve into characters that the user can actually talk to.

Data Entry

Applications that require users to keyboard paper-based data into the computer (such as

database front-ends and spreadsheets) are good areas for a speech recognition application.

Reading data directly to the computer is much easier for most users and can significantly

speed up data entry.

While speech recognition technology cannot effectively be used to enter names, it can enter

numbers or items selected from a small (less than 100 items) list. Some recognizers can even

handle spelling fairly well. If an application has fields with mutually exclusive data types

(for example, one field allows "male" or "female", another is for age, and a third is for city),

Page 28: Speech Recognition Seminar Report

SPEECH RECOGNITION

21

the speech recognition engine can process the command and automatically determine which

field to fill in.

Document Editing

This is a scenario in which one or both modes of speech recognition could be used to

dramatically improve productivity. Dictation would allow users to dictate entire documents

without typing. Command and control would allow users to modify formatting or change

views without using the mouse or keyboard. For example, a word processor might provide

commands like "bold", "italic", "change to Times New Roman font", "use bullet list text

style," and "use 18 point type." A paint package might have "select eraser" or "choose a wider

brush."

Speaker Identification

Recognizing the patterns of speech of a various persons can be used to identify them

separately. It can be used as a Biometric authentication system in which the user

authenticates him/her self with the help of their speech. The various characteristics of speech

which involves frequency, amplitude & other special features are captured & compared with

the previously stored database.

Automation at Call Centers

Receiving call from a huge number of customers, answering them or diverting them to a

particular customer care representative according to the customers demand. It can be used to

provide a faster response to the customer & provide better service.

Medical Disabilities

This technology is a great boon for blind & handicapped as they can utilize the speech

recognition technology for various works. Those who are unable to operate the computer

through keyboard & mouse can operate it with just their voice.

Fighter Aircrafts

Pilots in fighter aircrafts have to keep a check on various functions going on in the aircraft.

They have to provide a faster response to the sudden changes in the aircraft maneuver. They

can give commands with their voice commands. It requires building a pilot voice template

before. The actions are confirmed through visual or aural feedback.

Page 29: Speech Recognition Seminar Report

SPEECH RECOGNITION

22

CHAPTER 5

CONCLUSION & FUTURE SCOPE

5.1 CONCLUSION

Speech recognition will revolutionize the way people interacted with Smart devices & will,

ultimately, differentiate the upcoming technologies. Almost all the smart devices coming

today in the market are capable of recognizing speech. Many areas can benefit from this

technology. Speech Recognition can be used for intuitive operation of computer-based

systems in daily life.

This technology will spawn revolutionary changes in the modern world and become a pivot

technology. Within five years, speech recognition technology will become so pervasive in our

daily lives that service environments lacking this technology will be considered inferior.

5.2 FUTURE SCOPE

Achieving efficient speaker independent word recognition

All the SR systems will be speaker independent and will produce the same kind out

output for a particular command irrespective of the user. SR systems will be able to

process the voice commands of all the users with very high accuracy & efficiency.

Ability to distinguish nuances of speech and meanings of words

SR systems would be able to distinguish between nuances phrases & meaningful

commands, & would be able to process the proper command out of the nuances phrases

correctly.

Stand-alone Speech Recognition Systems

Presently there is no SR stand-alone systems available, all the SR systems been developed

are based on one or the other preexisting hardware and software platforms. But in near

future Stand Alone SR systems might be available in the market.

Page 30: Speech Recognition Seminar Report

SPEECH RECOGNITION

23

Wearable Speech Recognition System

SR systems will be embedded in wearable devices or things such as wrist watch,

necklace, bracelet etc. There will be no need of carrying bulky devices and the technology

can be used on the go.

Talk with all the devices.

All the devices including Smart phones, Computers, Television, Refrigerator, Washing

Machines etc will be controlled with the voice commands of the user. There will be no

need of having a Remote or pressing buttons on the device to interact with it.

Page 31: Speech Recognition Seminar Report

SPEECH RECOGNITION

24

ACKNOWLEDGMENT

I would like to avail this opportunity to express deep gratitude to my seminar guide Prof. S. R. Lahane who took keen interest in the topic and provided excellent guidance and motivation for the completion of my seminar. I would also like to thank Prof. N. V. Alone (Head of Department, Computer Engineering), Prof. Dr. P. C. Kulkarni (Principal, GES RHS COEMSR) and all the faculty members of the college for their help and support. I would also like to thank my parents and friends, without their continuous motivation, help and support this would not have been possible.

Suraj Vitthal Gaikwad T.E. Computer

Exam Seat Number: T80694222

Page 32: Speech Recognition Seminar Report

SPEECH RECOGNITION

25

BIBLIOGRAPHY

[1] JOE TEBELSKIS {1995}, SPEECH RECOGNITION USING NEURAL NETWORKS,

School of Computer Science, Carnegie Mellon University

[2] KÅRE SJÖLANDER {2003}, An HMM-based system for automatic segmentation and

alignment of speech, Umeå University, Department of Philosophy and Linguistics

[3] KLAUS RIES {1999}, HMM AND NEURAL NETWORK BASED SPEECH ACT

DETECTION, International Conference on Acoustics and Signal Processing (ICASSP’99)

[4] B. PLANNERER {2005}, AN INTRODUCTION TO SPEECH RECOGNITION

[5] KIMBERLEE A. KEMBLE, AN INTRODUCTION TO SPEECH RECOGNITION, Voice

Systems Middleware Education, IBM

[6] LAURA SCHINDLER {2005}, A SPEECH RECOGNITION AND SYNTHESIS TOOL,

Department of Mathematics and Computer Science, College of Arts and Science, Stetson

University

[7] MIKAEL NILSSON, MARCUS EGNARSSON {2002}, SPEECH RECOGNITION USING

HMM, Blekinge Institute Of technology