speech recognition by iqbal

IT FIT FOROR M MANAGERSANAGERS

RREPORTEPORT O ONN

SSPEECHPEECH R RECOGNITIONECOGNITION S SYSTEMYSTEM SSUBMITTEDUBMITTED TOTO D DRR. R. ROSHANOSHAN A. S A. SHEIKHHEIKH

MMARCHARCH, 2009, 2009

IQBAL S/O SHAHZAD

REGISTRATION # 9952

MBA(M) - SECTION A

Speech Recognition System IT Project

AABSTRACTBSTRACT

This report has been submitted to Dr. Roshan A. Sheikh of Iqra University Karachi, as a requirement for the completion of the course , IT for Managers for MBA students. I have prepared this brief report on Speech Recognition System after deep study and research on the topic for two weeks. I have done by best in presenting, explaining the concepts and interpreting the report in its proper form.

This report presents an overview of speech recognition technology, software, development and applications. It begins with an introduction to Speech Recognition Technology then it explains how such systems work, and the level of accuracy that can be expected. Applications of speech recognition technology in education and beyond are then explored. A brief comparison of the most common systems is presented, as well as notes on the main centres of speech recognition research in the UK educational sector. The report concludes with potential uses of speech recognition in education, probable main uses of the technology in the future, and a selection of key web-based resources. It also includes software that are being used for this purpose in homes and also in business environment.

A video is also presented with this report which shows an example of how we can use speech recognition in windows vista. This video is prepared solely by me on my personal computer. It is available in the soft copy of the project in attached CD.

IQBAL P a g e | 1


TTABLEABLE O OFF C CONTENTSONTENTS

1. Introduction ………………………………………………………………………………………….......... 41.1 Introduction ………………………………………………………………………………………… 41.2 Closer Look …………………………………………………………………………………………. 4-5

2. Terms and Concepts ……………………………………………………………………………….……… 62.1 Utterances ………………………………………………………………………………….………. 62.2 Pronunciation …………………………………………………………………………….…….…. 62.3 Grammar …………………………………………………………………………………….……… 72.4 Speaker Dependence ……………………………………………………………….….……… 72.5 Accuracy …………………………………………………………………………………….………. 82.6 Training ………………………………………………………………………………….….………. 8-9

3. How Speech Recognition Works ………………………………………………………………….… 103.1 How Speech Recognition Works ……………………………………………………….… 103.2 Acceptance and Regection ……………………………………………………………….… 11-12

4. Types of Speech Recognition ………………………………………………………………………… 134.1 Isolated Words …………………………………………………………………………………… 134.2 Connected Words ………………………………………………………………………………. 134.3 Continuous Speech …………………………………………………………………………….. 134.4 Spontaneous Speech ………………………………………………………………………….. 13-144.5 Voice Verification / Identification ………………………………………………………. 14

5. Hardware ……………………………………………………………………………………………………... 155.1 Soud Cards …………………………………………………………………………………………. 155.2 Microphones ……………………………………………………………………………………… 15-165.3 Computers / Processors …………………………………………………………………….. 16

6. Uses / Applications of Speech Recognition ………………………………………………….. 176.1 Military ……………………………………………………………………………………………... 17

6.1.1 High Performance Fighter Aircrafts ………………………………………. 176.1.2 Helicopters ……………………………………………………………………………. 186.1.3 Training Air Traffic Controllers ……………………………………………… 18-19

6.2 People with Disabilities ………………………………………………………………………. 196.3 Speech Recognition in Telephony Environment ………………………………….. 20

6.3.1 Communications Management and Personal Assistants …………. 21

IQBAL P a g e | 2


6.3.2 General Information …………………………………………………….…………. 216.3.3 E-Commerce …………………………………………………………………………… 21

6.4 Potential Uses in Education ………………………………………………………………… 22-236.5 Computer and Video Games ………………………………………………………………. 23-246.6 Medical Transcription ………………………………………………………………………… 24-256.7 Mobile Devices …………………………………………………………………………………... 25-266.8 Voice Security Systems ……………………………………………………………………….. 26-27

7. Future Applications ………………………………………………………………………………………. 287.1 Home / Domestic Appliances …………………………………………………………….. 28-297.2 Wearable Computers ………………………………………………………………………… 297.3 Precision Surgery ………………………………………………………………………………. 30

8. Speech Recognition Software ………………………………………………………………………. 318.1 Free Software ……………………………………………………………………………………. 31-328.2 Commercial Software ……………………………………………………………………….. 32

8.2.1 Dragon Naturally Speeking ……………………………………………………. 32-338.2.2 IBM Via Voice ……………………………………………………………………….. 338.2.3 Microsoft Speech Recognition System …………………………………… 348.2.4 MacSpeech Dictate ……………………………………………………………….. 358.2.5 Philips Speech Engine ……………………………………………………………. 35-368.2.6 Other commercial software ………………………………………………….. 36

9. Conclusion …………………………………………………………………………………………………… 37

IQBAL P a g e | 3


1. I1. INTRODUCTIONNTRODUCTION

Have you ever talked to your computer? I mean, have you really, really talked to your computer? Where it actually recognized what you said and then did something as a result? If you have, then you've used a technology known as speech recognition.

Designing a machine that understand human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Today speech technologies are commercially available for a limited but interesting range of tasks. These technologies enable machines to respond correctly and reliably to human voices, and provide useful and valuable services. While we are still far from having a machine that converses with humans on any topic like another human, many important scientific and technological advances have taken place, bringing us closer to the machines that recognize and understand fluently spoken speech.

“Speech Recognition Simply is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text. Speech recognition, also referred to as voice recognition, is software technology that lets the user control computer functions and dictate text by voice. For example, a person can move the mouse cursor with a voice command, such as “mouse up;” control application functions, such as opening up a file menu; or create documents, such as letters or reports or start media player by saying “Music”.

1.2 A Closer LookThe speech recognition process is performed by a software component known as the

speech recognition engine. The primary function of the speech recognition engine is to process spoken input and translate it into text that an application understands. The application can then do one of two things:

The application can interpret the result of the recognition as a command. In this case, the application is a command and control application. An example of a command and control application is one in which the caller says “check balance”, and the application returns the current balance of the caller’s account.

If an application handles the recognized text simply as text, then it is considered a dictation application. In a dictation application, if you said “check balance,” the application would not interpret the result, but simply return the text “check balance”.

IQBAL P a g e | 4


Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace, or reduce the reliability on, standard keyboard and mouse input. This can especially assist the following:

People who have little keyboard skills or experience, who are slow typists, or do not have the time or resources to develop keyboard skills.

Dyslexic people, or others who have problems with character or word use and manipulation in a textual form.

People with physical disabilities that affect either their data entry, or ability to read (and therefore check) what they have entered.

A speech recognition system consists of the following:

A microphone, for the person to speak into. Speech recognition software. A computer to take and interpret the speech. A good quality soundcard for input and/or output. A proper and good pronunciation.

However, systems on computers meant for more individual use, such as for personal word processing, usually require a degree of “training” before use. Here, an individual user “trains” the system to understand words or word fragments (see section 2.6); this training is often referred to as “enrolment”.

IQBAL P a g e | 5


2. T2. TERMSERMS ANDAND C CONCEPTSONCEPTS

Following are a few of the basic terms and concepts that are fundamental to speech recognition. It is important to have a good understanding of these concepts.

2.1 UtterancesWhen the user says something, this is known as an utterance. An utterance is any

stream of speech between two periods of silence. Utterances are sent to the speech engine to be processed.

Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance. Here's how it works. The speech recognition engine is "listening" for speech input. When the engine detects audio input - in other words, a lack of silence -- the beginning of an utterance is signaled. Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs.

Utterances are sent to the speech engine to be processed. If the user doesn’t say anything, the engine returns what is known as a silence timeout - an indication that there was no speech detected within the expected timeframe, and the application takes an appropriate action, such as reprompting the user for input.

An utterance can be a single word, or it can contain multiple words (a phrase or a sentence). For example, “Word”, “Microsoft Word,” or “I’d like to run Microsoft Word” are all examples of possible utterances. Whether these words and phrases are valid at a particular point in a dialog is determined by which grammars are active. Note that there are small snippets of silence between the words spoken within a phrase. If the user pauses too long between the words of a phrase, the end of an utterance can be detected too soon, and only a partial phrase will be processed by the engine.

2.2 PronunciationThe speech recognition engine uses all sorts of data, statistical models, and algorithms

to convert spoken input into text. One piece of information that the speech recognition engine uses to process a word is its pronunciation, which represents what the speech engine thinks a word should sound like.

Words can have multiple pronunciations associated with them. For example, the word “the” has at least two pronunciations in the U.S. English language: “thee” and “thuh”.

IQBAL P a g e | 6


2.3 GrammarGrammars define the domain, or context, within which the recognition engine works.

The engine compares the current utterance against the words and phrases in the active grammars. If the user says something that is not in the grammar, the speech engine will not be able to understand it correctly. So usually speech engines have a very vast grammar.

Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the Speech Recognition system. Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand or more!

2.4 Speaker DependenceSpeaker dependence describes the degree to which a speech recognition system

requires knowledge of a speaker’s individual voice characteristics to successfully process speech. The speech recognition engine can “learn” how you speak words and phrases; it can be trained to your voice.

Speech recognition systems that require a user to train the system to his/her voice are known as speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent like IBM Via Voice. Because they operate on very large vocabularies, dictation systems perform much better when the speaker has spent the time to train the system to his/her voice.

Speech recognition systems that do not require a user to train the system are known as speaker-independent systems. Speech recognition in the VoiceXML world must be speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling into your web site. You cannot require that each caller train the system to his or her voice. The speech recognition system in a voice-enabled web application MUST successfully process the speech of many different callers without having to understand the individual voice characteristics of each caller.

IQBAL P a g e | 7


2.5 AccuracyThe ability of a recognizer can be examined by measuring its accuracy − or how well it

recognizes utterances. The performance of a speech recognition system is measurable. Perhaps the most widely used measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways. Arguably the most important measurement of accuracy is whether the desired end result occurred. This measurement is useful in validating application design. For example, if the user said "yes," the engine returned "yes," and the "YES" action was executed, it is clear that the desired result was achieved. But what happens if the engine returns text that does not exactly match the utterance? For example, what if the user said "nope," the engine returned "no," yet the "NO" action was executed? Should that be considered a successful dialog? The answer to that question is yes because the desired result was acheived.

Another measurement of recognition accuracy is whether the engine recognized the utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage and represents the number of utterances recognized correctly out of the total number of utterances spoken. It is a useful measurement when validating grammar design. Using the previous example, if the engine returned "nope" when the user said "no," this would be considered a recognition error. Based on the accuracy measurement, you may want to analyze your grammar to determine if there is anything you can do to improve accuracy. For instance, you might need to add "nope" as a valid word to your grammar. You may also want to check your grammar to see if it allows words that are acoustically similar (for example, "repeat/delete," "Austin/Boston," and "Addison/Madison"), and determine if there is any way you can make the allowable words more distinctive to the engine.

Recognition accuracy is an important measure for all speech recognition applications. It is tied to grammar design and to the environment of the user. Good ASR (Automatic Speech Recognition) systems have an accuracy of 98% or more!

2.6 TrainingSome speech recognizers have the ability to adapt to a speaker. When the system has

this ability, it may allow training to take place. An ASR (Automatic Speech Recognition) system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy.

IQBAL P a g e | 8


Training can also be used by speakers that have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.

IQBAL P a g e | 9


3. H3. HOWOW S SPEECHPEECH R RECOGNITIONECOGNITION W WORKSORKS

Now that we've discussed some of the basic terms and concepts involved in speech recognition, let's put them together and take a look at how the speech recognition process works.

As you can probably imagine, the speech recognition engine has a rather complex task to handle, that of taking raw audio input and translating it to recognized text that an application understands. As shown in the diagram below, the major components we want to discuss are:

Audio input - Transform of the digital audio into a better acoustic representation Apply a "grammar" so the speech recognizer knows what phonemes to expect. A

grammar could be anything from a context-free grammar to full-blown English. Acoustic Model Recognized text

The first thing we want to take a look at is the audio input coming into the recognition engine. It is important to understand that this audio stream is rarely pristine. It contains not only the speech data (what was said) but also background noise. This noise can interfere with

IQBAL P a g e | 10


the recognition process, and the speech engine must handle (and possibly even adapt to) the environment within which the audio is spoken.

As we've discussed, it is the job of the speech recognition engine to convert spoken input into text. To do this, it employs all sorts of data, statistics, and software algorithms. Its first job is to process the incoming audio signal and convert it into a format best suited for further analysis. Once the speech data is in the proper format, the engine searches for the best match. It does this by taking into consideration the words and phrases it knows about (the active grammars), along with its knowledge of the environment in which it is operating. The knowledge of the environment is provided in the form of an acoustic model. Once it identifies the most likely match for what was said, it returns what it recognized as a text string.

Most speech engines try very hard to find a match, and are usually very "forgiving." But it is important to note that the engine is always returning it's best guess for what was said.

(This is an example of a digital audio)

3.2 Acceptance and RejectionWhen the recognition engine processes an utterance, it returns a result. The result can

be either of two states: acceptance or rejection. An accepted utterance is one in which the engine returns recognized text.

Whatever the caller says, the speech recognition engine tries very hard to match the utterance to a word or phrase in the active grammar. Sometimes the match may be poor because the caller said something that the application was not expecting, or the caller spoke indistinctly. In these cases, the speech engine returns the closest match, which might be

IQBAL P a g e | 11


incorrect. Some engines also return a confidence score along with the text to indicate the likelihood that the returned text is correct.

Not all utterances that are processed by the speech engine are accepted. Acceptance or rejection is flagged by the engine with each processed utterance.

IQBAL P a g e | 12


4. T4. TYPESYPES OFOF S SPEECHPEECH R RECOGNITIONECOGNITION

Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using.

4.1 Isolated WordsIsolated word recognizers usually require each utterance to have quiet (lack of an audio

signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not−Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.

4.2 Connected WordsConnect word systems (or more correctly 'connected utterances') are similar to Isolated

words, but allow separate utterances to be 'run−together' with a minimal pause between them.

4.3 Continuous SpeechContinuous recognition is the next step. Recognizers with continuous speech capabilities

are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.

4.4 Spontaneous SpeechThere appears to be a variety of definitions for what spontaneous speech actually is. At

a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR

IQBAL P a g e | 13


system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

4.5 Voice Verification/IdentificationSome ASR systems have the ability to identify specific users. This document doesn't

cover verification or security systems.

IQBAL P a g e | 14


5. H5. HARDWAREARDWARE

5.1 Sound CardsBecause speech requires a relatively low bandwidth, just about any medium−high

quality 16 bit sound card will get the job done. You must have sound enabled in your kernel, and you must have correct drivers installed. Sound card quality often starts a heated discussion about their impact on accuracy and noise.

Sound cards with the 'cleanest' A/D (analog to digital) conversions are recommended, but most often the clarity of the digital sample is more dependent on the microphone quality and even more dependent on the environmental noise. Electrical "noise" from monitors, pci slots, hard−drives, etc. are usually nothing compared to audible noise from the computer fans, squeaking chairs, or heavy breathing.

Some ASR software packages may require a specific sound card. It's usually a good idea to stay away from specific hardware requirements, because it limits many of your possible future options and decisions. You'll have to weigh the benefits and costs if you are considering packages that require specific hardware to function properly.

5.2 MicrophonesA quality microphone is key when utilizing ASR. In most cases, a desktop microphone

just won't do the job. They tend to pick up more ambient noise that gives ASR programs a hard time.

Hand held microphones are also not the best choice as they can be cumbersome to pick up all the time. While they do limit the amount of ambient noise, they are most useful in applications that require changing speakers often, or when speaking to the recognizer isn't done frequently (when wearing a headset isn't an option).

The best choice, and by far the most common is the headset style. It allows the ambient noise to be minimized, while allowing you to have the microphone at the tip of your tongue all the time. Headsets are available without earphones and with earphones (mono or stereo). I recommend the stereo headphones, but it's just a matter of personal taste.

A quick note about levels: Don't forget to turn up your microphone volume. This can be done with a program such as XMixer or OSS Mixer and care should be used to avoid feedback

IQBAL P a g e | 15


noise. If the ASR software includes auto−adjustment programs, use them instead, as they are optimized for their particular recognition system.

5.3 Computers/ProcessorsASR applications can be heavily dependent on processing speed. This is because a large

amount of digital filtering and signal processing can take place in ASR.

As with just about any cpu intensive software, the faster the better. Also, the more memory the better. It's possible to do some SR with 100MHz and 16M RAM, but for fast processing (large dictionaries, complex recognition schemes, or high sample rates), you should shoot for a minimum of a 1 Ghz and 1 GB RAM. Because of the processing required, most software packages list their minimum requirements.

IQBAL P a g e | 16


6. U6. USESSES / A / APPLICATIONSPPLICATIONS

6.1 Military

6.1.1 High-performance fighter aircraftSubstantial efforts have been devoted in the last decade to the test and evaluation of

speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft, the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.

Some important conclusions from the work were as follows:

1. Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently.

2. Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful — with lower recognition rates, pilots would not use the system.

3. More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained.

4. Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft.

The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent system, i.e. it requires each pilot to create a template. The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other cockpit functions. Voice commands are confirmed by visual and/or aural feedback. The system is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.

IQBAL P a g e | 17


6.1.2 HelicoptersThe problems of achieving high recognition accuracy under stress and noise pertain

strongly to the helicopter environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma helicopter. There has also been much useful work in Canada. Results have been encouraging, and voice applications have included: control of communication radios; setting of navigation systems; and control of an automated target handover system.

As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment. Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve performance improvements in operational settings.

6.1.3 Training Air Traffic ControllersTraining for military air traffic controllers (ATC) represents an excellent application for

speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation. Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task.

The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system. However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications. The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation. Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed

IQBAL P a g e | 18


for the ATC task. Research in France has focused on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task-domain grammar constraints.

Another approach to ATC simulation with speech recognition has been created by Supremis. The Supremis system is not constrained by rigid grammars imposed by the underlying limitations of other recognition strategies.

6.2 People with DisabilitiesIt has been suggested that one of the most promising areas for the application of speech

recognition is in helping handicapped people (Leggett and Williams, 1984). Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. For example, Braille input/output devices touch screen systems and trackballs have all been used successfully in the classrooms. The technology is also useful to people with learning disabilities who experience difficulty with spelling and writing. Some individuals with speech impairments may use speech recognition as a therapeutic tool to improve vocal quality. People with overuse or repetitive stress injuries also benefit from using speech recognition to operate their computers hands free. Speech recognition technology has great potential to provide people with disabilities greater access to computers and a world of opportunities.

Mr. Jones is a reporter who must submit his articles in HTML for publishing in an on-line journal. Over his twenty-year career, he has developed repetitive stress injury (RSI) in his hands and arms, and it has become painful for him to type. He uses a combination of speech recognition and an alternative keyboard to prepare his articles, but he doesn't use a mouse. It took him several months to become sufficiently accustomed to using speech recognition to be comfortable working for many hours at a time. There are some things he has not worked out yet, such as a sound card conflict that arises whenever he tries to use speech recognition on Web sites that have streaming audio. (Source : http://www.w3.org/WAI/EO/Drafts/PWD-Use-Web/).

IQBAL P a g e | 19

http://www.w3.org/WAI/EO/Drafts/PWD-Use-Web/

http://www.w3.org/WAI/EO/Drafts/PWD-Use-Web/


6.3 Speech Recognition in Telephony EnvironmentWilliam Meisel, who holds a Ph.D. in Electrical Engineering, ran a speech recognition

company for ten years. He is president of the speech industry consulting firm TMA Associates and publisher and editor of Speech Recognition Update newsletter. According to him

Telephone speech recognition creates a Voice Web. Sites that support speech recognition constitute the Voice Web. Most sites today have individual phone numbers (typically toll-free). Such sites are often called "voice portals". There are, however, likely to be more popular voice portals than Web portals; every wireless and landline telephone service provider will eventually be a voice portal, and there will be independent, corporate, and specialized voice portals. VoiceXML, a new standard, created by the VoiceXML Form (www.voicexml.org) and the W3C Voice Browser working group (www.w3.org/voice), is a way that companies can provide a voice-interactive application on a Web server without needing speech engines or telephone line interface hardware. The VoiceXML code is downloaded to the voice portal and executed by a VoiceXML interpreter, much as a Web browser on a PC interprets HTML.

(Source : William Meisel’s Guide Book on The Voice Web)

The Voice Web is not just an extension of the Internet, although information on existing Web sites can be used to support interactive voice services. It can run applications totally unlike visual Web applications and totally independent of the HTML-based Web. Some of the applications that the Voice Web is supporting are listed here.

IQBAL P a g e | 20


6.3.1 Communications management and personal assistantsCommunications management usually includes dialing by name using a personal

directory. Personal-assistant functionality includes call screening, taking and accessing voice messages, and one-number access to the subscriber (scanning several subscriber numbers based on subscriber instructions). Other personalized features include maintaining a schedule and delivering reminders. Unified messaging includes features such as reviewing email or fax headers by phone using text-to-speech. Since subscribers will make calls through their personal assistant, the voice portal can potentially get additional revenues from providing bundled local and/or long-distance service.

Enterprise applications, such as voice-activated auto attendants that direct calls by name, can be a corporate voice portal. Corporate voice portals can also provide such services as reservations for a conference, location of a local store outlet, or a connection to customer service.

6.3.2 General informationGeneral information includes weather, sports scores, horoscopes, general news,

financial news, stock quotes, traffic conditions, and driving directions. Such information is intended to make a voice-enabled service part of a subscriber’s daily habit. Information can be customized, using, for example, the user’s personal stock portfolio or the user’s current location. As voice portals evolve, the caller will be able to "voicemark" specialized voice-equipped Web sites.

6.3.3 E-commerceV-commerce supports a variety of transactions that can result in product or service

sales. These include transactions similar to ordering from a Web sites or telephone catalog service. They also include finding a business by saying its trade name or its category.

Entertainment is part of e-commerce, and it will be part of the Voice Web. For example, the caller can use speech recognition to choose audio channels to listen to.

(Source : Receiver Magazine, Vodafone - 2001)

IQBAL P a g e | 21


6.4 Potential uses in educationContact with a number of practitioners and researchers in the field of speech

recognition led to some interesting speculation regarding the feasible use of this technology in education.

No. Applications Problems and Likelihood

1 Teaching students of foreign languages to pronounce vocabulary correctly.

Unlikely in near future on a large scale, due to the software training currently involved.

2 Teaching overseas students to pronounce English correctly.

3 Making notes of observations during scientific experiments, so the scientist/research can focus on the observation without needing to view the monitor or keyboard. Similar to how a coroner verbally records notes during an autopsy.

Likely, and is probably already used in individual circumstances. Noise from the experiment, the researcher need to rapidly record some observations, and a vocabulary that understands the scientific terms present some issues.

4 Enabling students who are physically handicapped and unable to use a keyboard to enter text verbally.

Used already, though becoming increasingly widespread.

5 Enabling people with textual interpretive problems e.g. Dyslexia, to enter text verbally.

Used already, though becoming increasingly widespread.

6 Restrictive access on a high security computer, where a keyboard or other input device may be used by hackers.

Interest from a number of people, though a lack of “proof of concept” research hinders further development. Unlikely to be available in the near future.

7 Narrative-oriented research, where transcripts are automatically generated. This would remove the time to manually generate the transcript, and human error.

Likely in the near future. Current speech recognition technology places unacceptable c ompromise between accuracy and inhibiting the interviewee. Quicker and easier training systems for the interviewee will help, as will increases in portable computing processing power.

8 Capturing the speech of a lecturer or tutor.

Unlikely on a large scale, due to vocabulary, training and interpretive issues. In addition, filming of the lecture results in audio and visual content combined which may be more useful.

9 Using a speech recognition system in an examination.

Very likely. Technically, this is possible, and within current UK examination guidelines

IQBAL P a g e | 22


this appears to be acceptable(Source : http://www.becta.org.uk/technology/speechrecog/docs/finalreport.pdf - the

final report (June 2000) from a experimental project to see how effective speech recognition technologies could be to people with special educational needs.)

6.5 Computer and Video GamesSpeech input has been used in a limited number of computer and video games, on a

variety of PC and console-based platforms, over the past decade. For example, the game Seaman24 involved growing and controlling strange half-man half fish characters in a virtual aquarium. A microphone, sold with the game, allowed the player to issue one of a pre-determined list of command words and questions to the fish. The accuracy of interpretation, in use, seemed variable; during gaming sessions colleagues with strong accents had to speak in an exaggerated and slower manner in order for the game to understand their commands.

Microphone-based games are available for two of the three main video game consoles (Playstation 2 and Xbox). However, these games primarily use speech in an online player to player manner, rather than spoken words being interpreted electronically. For example, a MotoGP for the Xbox allows online players to ride against each other in a motorbike race simulation, and speak (via microphone headset) to the nearest players (bikers) in the race. There is currently interest, but less development, of video games that interpret speech.

The Microsoft Xbox, Nintendo GameCube, and Sony PlayStation 2 consoles all offer games with speech input/output. Currently, most games are war-action-shooter games. In these, speech recognition provides high-level commands to virtual teammates who respond with a variety of recorded quips. Lets take examples of two games i.e. graphically-realistic, tactical squad-based, shooter games Ghost Recon 2 and SOCOM II: U.S. Navy Seals. Both these games are available in Sony Playstation 2. The speech recognition systems for these games are provided by Fonix and ScanSoft, respectively.

In Ghost Recon 2, the user is the leader of a team of three secret Special Forces soldiers who must capture various military targets in North Korea in the year 2007. The team is critical to the user’s survival from enemy gunfire. Saying “Move out!” directs the team to move ahead of you as you make your way through the virtual, hilly terrain toward various objectives. The speech commands (“Move out,” “Covering fire,” “Grenade,” “Take point,” “Hold position,” “Regroup”) are easily-

IQBAL P a g e | 23


recalled, high-level instructions to the team members. The commands that can be obeyed depend on the immediate situation. If you say, “Take point,” and the hostile fire is too great the designated team member may say, “No can do, Captain.” Occasionally, the retort is somewhat less respectful.

In SOCOM II: U.S. Navy Seals, a team of four men including the first person leader attempts to stop an arms smuggling group in rural Albania. The team has to avoid the enemy, meet an informant, blow up weapons caches, and make their escape. The speech commands in this game are spoken in three parts, using a simple grammar. The commands may be addressed to “Fireteam” (all other team members) or individuals like, “Able” (your partner). Then there are approximately 12 action commands including “Fire at will,” “Deploy,” “Move to,” “Get down,” and others. The third part of the command includes nine letters of the military alphabet (“Charlie,” “Delta,” etc.) indicating where the “Move to” and similar commands are intended. They represent the specific locations of game objectives.

(Source: Article from The Speech Technology Magazine Apr 2005, http://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=29432)

6.6 Medical TranscriptionMedical transcription, also known as MT, is an allied

health profession, which deals in the process of transcription, or converting voice-recorded reports as dictated by physicians and/or other healthcare professionals, into text format.

Every day, doctors scour the market looking for new ways to help simplify their office routines and reduce their costs. Medical Transcription software saves their time and money. The speech recognition product produces accurate and fully formatted transcriptions from clinicians' dictations. The goal is to minimize editing time by MTs and, as a result, increase MT productivity. It interprets and formats a document, so that it is close to a final product.

IQBAL P a g e | 24

http://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=29432


Benefits:

Organized and formatted document sections Punctuation inserted even if not spoken Numbers interpreted and presented appropriately. This includes dosages, measurements,

lists, etc. Formatting based on each organization’s preferences and specifications Inserts speech-activated ‘normals’ No explicit training required Continually learns and improved from MT edits

Examples:

When a clinician dictates: "Exam…vital signs…two twelve…eighty eight and regular…thirteen…BP one forty one hundred and one thirty five ninety five"

Speech Recognition software can output: PHYSICAL EXAMINATION: VITAL SIGNS: Weight 212, pulse 88 and regular, respiration 13, blood pressure is 140/100, 135/95.

When a provider says: "The following problems were reviewed…hypertension …please enter my hypertension template…use my normal cad"

Speech Recognition software can output: PROBLEMS: The following problems were reviewed:

Hypertension: No headache, visual disturbance, chest pain, palpitation, focal neurologic complaint, dyspnea, edema, claudication, or complaint from current medication.

Coronary artery disease: No chest pain, dyspnea, PND, orthopnea, palpitation, weakness, syncope, or obvious problems related to medications.

6.7 Mobile DevicesThe growth of cellular telephony combined with recent advances in speech recognition

technology results in sizeable potential opportunities for mobile speech recognition applications. Speech recognition in mobile phone have already been introduced but there is a lot of work to be done in this particular field. First time when speech recognition was introduced in mobiles, it was used to call a contact by saying its name. In that case first the user

IQBAL P a g e | 25


needed to record voice clips of the names of each contact and associate them with their respective contacts. So when the user said the name the mobile compared it with already recorded sounds for each contact and then called the person whose name was spoken.

New smart mobile phones are introducing every month. These mobiles don’t require recording the names first. They have their own speech system, which can read the names written in English. So when the user says a name, it uses its speech system to compare the user spoken sound with saved contacts and then calls the contact whose name is being spoken.

Nuance Communications has launched Nuance Mobile Speech Platform that will improve the text-to-speech and speech recognition abilities of mobile devices. Through this platform, end users will be able to perform searches, dictate emails and SMS messages, and have any incoming emails and messages read out to them, which will improve the usability and efficiency of mobile devices.

The Nuance Mobile Speech Platform can be used to speech-enable a mobile application, and specifically offers pre-built components for the following:

Nuance Local Search - search business names and categories, residential listings, weather, dining and entertainment, movies, etc.

Nuance Mobile Navigation - voice destination entry (including street addresses, businesses and points of interest) and spoken turn-by-turn directions.

Nuance Content Search - search catalogs with items in music, video, games and more. Nuance Mobile Web Search - search the Web from a mobile device. Nuance Mobile Communications - compose email, SMS, and IM messages by speaking.

(Source: Nuance Communications http://www.nuance.com)

6.8 Voice Security SystemsVoice Security Systems technology uses a person's voice print to uniquely identify

individuals using biometric speaker verification technology. Speech is processed through a non-contact method; you do not need to see or to touch the person to be able to recognize them. The popularity of speaker verification is swiftly growing because speech is easy to obtain without the addition of dedicated hardware. Improved, robust speech recognition algorithms and PC hardware have also brought this one-time futuristic idea into the present.

IQBAL P a g e | 26

http://www.nuance.com/


At Voice Security Systems, a decade of research and development has lead them to believe that the explosive speech processing market is here to stay. Their Voice Protect® method of biometric voice authentication is ideally suited for low memory, database independent applications using smart cards or other physical devices such as cell phones. Due to the value of biometric security for use in fraud prevention, and the added convenience of knowing a person is who they claim to be, they believe speaker verification will be more widely accepted by the consumer market before speech recognition.

Voice Security Systems can deliver biometric security technology to the market at a lower cost than anyone else in the industry, with no reoccurring maintenance costs such as database management or complicated user training. Once the Voice Protect® technology is built into a product it will continue to function independently for the life of the product.

Voice Security Systems can be applied in our daily lives, for example it can be successfully applied in Garage Door openers, Computers and laptops, Automobiles, PDA and handheld devices, Smartcard applications, Cell phones, Door access and ATM Machines.

(Source: Voice Security Systems Inc. http://www.voice-security.com/)

IQBAL P a g e | 27

http://www.voice-security.com/


7. F7. FUTUREUTURE A APPLICATIONSPPLICATIONS

There are a number of scenarios where speech recognition is either being delivered, developed for, researched or seriously discussed. As with many contemporary technologies, such as the Internet, online payment systems and mobile phone functionality, development is at least partially driven.

IBM intends to have better-than-human Automatic Speech Recognition by 2010. Bill Gates predicted that by 2011 the quality of ASR will catch up to humans. Justin Rattner from Intel said in 2005 that by 2015, computers will have "strong capabilities" in speech-to-text.

At some point in the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words. Although it is a huge leap in terms of computational power and software sophistication, some researchers argue that speech recognition development offers the most direct line from the computers of today to true artificial intelligence. We can talk to our computers today. In 25 years, they may very well talk back.

7.1 Home AppliancesDesigners have developed very convenient user interfaces to consumer appliances.

What could be easier than pressing buttons on a remote control to select television channels or flipping a switch to turn on a light? These types of direct manipulation user interfaces will continue to be widely used. However, because current buttons and switches are not intelligent, you cannot ask your remote control when "Star Trek" is on, and you must walk to the light switch before turning the light on. Speech enables consumer appliances to act intelligently, responding to speech commands and answering verbal questions. For example, speech enhances consumer appliances by enabling the user to say instructions such as:

1. To the VCR: "Record tonight's 'Star Trek'."2. To the coffeepot: "Start at 6:30 a.m. tomorrow."3. To the light switch: "Turn on the lights one half-hour before sunset."

There is, inevitable, interest in the use of speech recognition in domestic appliances such as ovens, refrigerators, dishwashers and washing machines. One school of thought is that,

IQBAL P a g e | 28


like the use of speech recognition in cars, this can reduce the number of parts and therefore the cost of production of the machine. However, removal of the normal buttons and controls would present problems for people who, for physical or learning reasons, cannot use speech recognition systems.

7.2 Wearable ComputersPerhaps the most futuristic application is in the use and functionality of wearable

computers i.e. unobtrusive devices that you can wear like a watch, or are even embedded in your clothes. These would allow people to go about their everyday lives, but still store information (thoughts, notes, to-do lists) verbally, or communicate via email, phone or videophone, through wearable devices. Crucially, this would be done without having to interact with the device, or even remember that it is there; the user would just speak, the device would know what to do with the speech, and would carry out the appropriate task.

The rapid miniaturization of computing devices, the rapid rise in processing power, and advances in mobile wireless technologies, are making these devices more feasible. There are still significant problems, such as background noise and the idiosyncrasies of an individual’s language, to overcome. However, it is speculated that reliable versions of such devices will become commercially available during this decade.

The conventional human-computer interface such as GUI, which assumes a keyboard, mouse, and bit-map display, is insufficient for the Wearable environment, especially for the Wearables. Although handwritten character recognizers and keyboards that can be used with one hand have been developed as input devices for computers, speech recognition has recently received more interest. The main reason for this is that it permits both hands and eyes to be kept free and therefore is less restricted in its use and can achieve quicker communication. In addition, speech can convey not only linguistic information but also the emotion and identity of speakers. IBM’s wearable PC described above has a microphone in its controller and can recognize speech as soon as the Via Voice has been installed.

IQBAL P a g e | 29


7.3 Precision SurgeryDevelopments in keyhole and micro surgery have clearly shown that an approach of as

little invasive or non-essential surgery as possible increases success rates and patient recovery times. There is occasional speculation in various medical for a regarding the use of speech recognition in precision surgery, where a procedure is partially or totally carried out by automated means.

For example, in removing a tumour or blockage without damaging surrounding tissue, a command could be given to make an incision of a precise and small length e.g. 2 millimetres. However, the legal implications of such technology are a formidable barrier to significant developments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentally sliced off, who would be liable – the surgeon, the surgery system developers, or the speech recognition software developers?

IQBAL P a g e | 30


8. S8. SPEECHPEECH R RECOGNITIONECOGNITION S SOFTWAREOFTWARE

Modern speech recognition software enables a single computer user to speak text and/or commands to the computer, largely, but not entirely, bypassing the use of the keyboard and mouse interface.

The idea has been portrayed in science fiction for many decades, quite frequently depicting computers that do not even have keyboards or mice. Such computers are also typically depicted as being able to keep up no matter how fast a person speaks, and without regard to who the speaker is, the language spoken, or even how many speakers there are. In other words, they're depicting a computer that hears in like manner as a multilingual person.

Attempts to develop usable speech recognition software began in the mid-1900s, and proved to be far more daunting than anyone had imagined. It also turned out to require so much computing power that only the most modern computers are now able to perform the functions required in real time (i.e., as fast as you can speak).

The first commercially practical products became available around 1990, (e.g. the Voice Navigator, a standalone computer dedicated 100% to speech recognition) and used up all the available computing power of the machine, which would send its output to a second computer. They weren't particularly accurate and could only understand a single person at a time, requiring retraining, not of the operator but of the machine itself, to work for another person. Despite these limitations, they could type so rapidly that even after taking time to make corrections, a person with disabilities could easily accomplish more work with the machine than without it. For persons with physical disabilities, the ability to simply talk to your computer could be a priceless asset. Consider for instance, an author with Parkinson's disease who can barely control his hands, yet is conveniently able to create an article.

8.1 Free SoftwaresThere are many software that are used for speech recognition. Many of them are free of

cost. Some free software are:

XVoice(http://www.compapp.dcu.ie/~tdoris/Xvoice/ http://www.zachary.com/creemer/xvoice.html)

IQBAL P a g e | 31

http://www.zachary.com/creemer/xvoice.html

http://www.compapp.dcu.ie/~tdoris/Xvoice/


CVoiceControl/kVoiceControl (http://www.kiecza.de/daniel/linux/index.html)

Ears (ftp://svr−ftp.eng.cam.ac.uk/comp.speech/recognition/)

NICO ANN Toolkit (http://www.speech.kth.se/NICO/index.html)

Myers' Hidden Markov Model Software (http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html)

Jialong He's Speech Recognition Research Tool (http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html)

Open Mind Speech (http://freespeech.sourceforge.net)

GVoice (http://www.cse.ogi.edu/~omega/gnome/gvoice/)

ISIP (http://www.isip.msstate.edu/project/speech/)

CMU Sphinx (http://www.speech.cs.cmu.edu/sphinx/Sphinx.html)

8.2 Commercial Software

8.2.1 Dragon Naturally SpeakingDragon NaturallySpeaking is almost universally regarded in reviews as the best voice-

recognition software, with the potential for 99.8 percent accuracy (reviews say 95 percent is more realistic). NaturallySpeaking integrates easily with Microsoft productivity software. The Preferred version can also be used with a compatible digital-audio recorder, MP3 player/recorder or PDA for recording voice notes or lectures on the go; NaturallySpeaking will later transcribe your recordings. Reviews say Dragon NaturallySpeaking is the most

IQBAL P a g e | 32

ftp://xn--svrftp-5d3c.eng.cam.ac.uk/comp.speech/recognition/

http://www.speech.cs.cmu.edu/sphinx/Sphinx.html

http://www.isip.msstate.edu/project/speech/

http://www.cse.ogi.edu/~omega/gnome/gvoice/

http://freespeech.sourceforge.net/

http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html

http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html

http://www.speech.kth.se/NICO/index.html

http://www.kiecza.de/daniel/linux/index.html


sophisticated product on the market, but that if you have Windows Vista or plan to buy a new computer with it, you should try the voice-recognition capabilities included with Vista, which by most accounts are nearly as robust as Dragon NaturallySpeaking.

(Source: http://www.nuance.com/naturallyspeaking/)

8.2.2 IBM Via VoiceIBM ViaVoice is a range of language-specific continuous speech synthesis software

products offered by IBM. The current version is designed primarily for use in embedded devices.

Individual language editions may have different features, specifications, technical support, and microphone support. Some of the products or editions available are:

Advanced Edition, Standard Edition, Personal Edition, ViaVoice for Mac OS X Edition, Pro USB Edition, Simply Dictation for Mac.

Prior to the development of ViaVoice, IBM developed a product named VoiceType. In 1997, ViaVoice was first introduced to the general public. Two years later, in 1999, IBM released a free of charge version of ViaVoice.

I didn't find a single review that recommends ViaVoice over Dragon NaturallySpeaking, but ViaVoice is the only program that will run on older or less powerful computers. Dragon NaturallySpeaking is extremely demanding (you need at the very least 512 MB RAM, a recent processor and 1 GB free hard-drive space). However, reviews say ViaVoice isn't as accurate as Dragon NaturallySpeaking, and mistakes aren't as easy to correct. ViaVoice hasn't been updated in years.

(Source: http://www.ibm.com/software/speech/)

IQBAL P a g e | 33

http://www.ibm.com/software/speech/

http://www.nuance.com/naturallyspeaking/


8.2.3 Microsoft Speech Recognition SystemIn 1993, Microsoft hired Xuedong Huang from CMU to lead its speech efforts. Microsoft

has been involved in research on speech recognition and text to speech.[2] The company's research eventually led to the development of the Speech API (SAPI).

Speech recognition technology has been used in some of Microsoft's products, including Microsoft Dictation (a research prototype that ran on Windows 9x). It was also included in Office XP, Office 2003[3], Microsoft Plus! for Windows XP, Windows XP Tablet PC Edition, and Windows Mobile (as Microsoft Voice Command)[4]. However, prior to Windows Vista, speech recognition was not mainstream. In response, Windows Speech Recognition was bundled with Windows Vista and released in 2006, making the operating system the first mainstream version of Microsoft Windows to offer fully-integrated support for speech recognition.

Windows Speech Recognition in Windows Vista empowers users to interact with their computers by voice. It was designed for people who want to significantly limit their use of the mouse and keyboard while maintaining or increasing their overall productivity. You can dictate documents and emails in mainstream applications, use voice commands to start and switch between applications, control the operating system, and even fill out forms on the Web.

Windows Speech Recognition is a new feature in Windows Vista, built using the latest Microsoft speech technologies. Windows Vista Speech Recognition provides excellent recognition accuracy that improves with each use as it adapts to your speaking style and vocabulary. Speech Recognition is available in English (U.S.), English (U.K.), German (Germany), French (France), Spanish (Spain), Japanese, Chinese (Traditional), and Chinese (Simplified).

Early reviews say it rivals Dragon NaturallySpeaking 9 for accuracy. If you buy a new computer, you'll get Vista by default, so you can try out its voice-recognition features before buying other software. You can also upgrade an older computer to Vista, but the system requirements are demanding. Reviewers say Dragon NaturallySpeaking has a slight edge, but cite no compelling reason to buy it if you have or plan to buy Vista.

(Source: http://www.microsoft.com/speech/speech2007/default.mspx)

IQBAL P a g e | 34

http://www.microsoft.com/speech/speech2007/default.mspx


8.2.4 MacSpeech DictateMacSpeech is a company that develops speech

recognition software for Apple Macintosh computers. In 2008, its previous flagship product, iListen, was replaced by Dictate, which is now built around Nuance's licensed Dragon NaturallySpeaking engine. MacSpeech was established in 1996 by current CEO Andrew Taylor. MacSpeech is currently the only company that develops voice dictation systems for the Macintosh. Its full product line is devoted to speech recognition and dictation.

Reviews say Dictate, introduced in early 2008, is based on the Dragon NaturallySpeaking engine. In tests, it is as accurate as Dragon NaturallySpeaking, and much better than the previous MacSpeech program, iListen. Dictate comes with a microphone headset. No products directly compete with Dictate.

(Source: http://www.macspeech.com/dictate/)

8.2.5 Philips Speech MagicSpeechMagic is an industrial grade platform for

capturing information in a digital format. It has been developed by Philips Speech Recognition Systems of Vienna, Austria. SpeechMagic features large-vocabulary speech recognition as well as a number of services aimed at supporting “accurate, convenient and efficient” information capturing in healthcare IT applications. The technology is mainly used in the healthcare sector, however, applications are also available for the legal market as well as for tax consultants.

SpeechMagic supports 25 recognition languages and provides more than 150 ConTexts (industry-specific vocabularies). More than 8,000 healthcare sites in 45 nations use SpeechMagic to capture information and create professional documents. The world’s largest site that is powered by SpeechMagic is in the United States with more than 60,000 authors, more than 3,000 editors and a throughput of 400 million lines per year.

Growth consulting company Frost & Sullivan has recognized SpeechMagic in 2005 with the Market Leadership Award in European Healthcare. In 2007, Frost & Sullivan presented Philips Speech Recognition Systems with the Global Excellence Award in Speech Recognition.

IQBAL P a g e | 35

http://www.macspeech.com/dictate/


(Source: http://www.myspeech.com/)

8.2.6 Other Commercial SoftwareThere are many other commercial software used for speech recognition. Some of them

are:

HTK(http://htk.eng.cam.ac.uk/)

CSLU Toolkit(http://cslu.cse.ogi.edu/toolkit/)

Simmortel Voice(http://www.simmortel.com)

Quack.com by AOL(http://www.quack.com)

SpeechWorks(http://www.speechworks.com)

Bable Technologies(http://www.babeltech.com)

Vocalis Speechware(http://www.vocalisspeechware.com)

Entropic(http://htk.eng.cam.ac.uk)

9.9. CCONCLUSIONONCLUSION

IQBAL P a g e | 36

http://htk.eng.cam.ac.uk/

http://www.vocalisspeechware.com/

http://www.babeltech.com/

http://www.speechworks.com/

http://www.quack.com/

http://www.simmortel.com/

http://cslu.cse.ogi.edu/toolkit/

http://htk.eng.cam.ac.uk/

http://www.myspeech.com/


Speech recognition will revolutionize the way people conduct business over the Web and will, ultimately, differentiate world-class e-businesses. VoiceXML ties speech recognition and telephony together and provides the technology with which businesses can develop and deploy voice-enabled Web solutions TODAY! These solutions can greatly expand the accessibility of Web-based self-service transactions to customers who would otherwise not have access, and, at the same time, leverage a business’ existing Web investments. Speech recognition and VoiceXML clearly represent the next wave of the Web. In near future people will be using their home and business computers by speech not by keyboard or mouse. Home automation will be completely based on speech recognition system.

IQBAL P a g e | 37

speech recognition by iqbal

Technology

automatic

speech recognition

speech recognition

speech recognition

voice security

windows speech

speech recognition

atc training