text-to-speech technology-based programming tool final doc

84
1 Text-to-Speech Technology-Based Programming Tool Text-to-Speech Technology- Based

Upload: chandra-sekhar

Post on 07-Apr-2015

184 views

Category:

Documents


1 download

DESCRIPTION

TEXT TO SPEECH CONVERSION

TRANSCRIPT

Page 1: Text-To-Speech Technology-Based Programming Tool Final Doc

1

Text-to-Speech

Technology-Based

Programming Tool

Page 2: Text-To-Speech Technology-Based Programming Tool Final Doc

2

ABSTRACT

This paper presents an audio programming tool based on text-to-speech

technology for blind and vision impaired people to learn programming. The tool

can help users edit a program then compile, debug and run it. All of these stages

are voice enabled. The programming language for evaluation is C# and the tool

is developed in Visual Studio .NET. Evaluations have shown that the

programming tool can help blind and vision impaired people implement software

applications and achieve equality of access and opportunity in information

technology education.

Page 3: Text-To-Speech Technology-Based Programming Tool Final Doc

3

Introduction

Blindness is the condition of lacking visual perception due

to physiological or neurological factors.

Various scales have been developed to describe the extent of vision loss and

define blindness.[1] Total blindness is the complete lack of form and visual light

perception and is clinically recorded as NLP, an abbreviation for "no light

perception."[1] Blindness is frequently used to describe severe visual

impairment with residual vision. Those described as having only light perception

have no more sight than the ability to tell light from dark and the general direction

of a light source.

In order to determine which people may need special assistance because of their

visual disabilities, various governmental jurisdictions have formulated more

complex definitions referred to as legal blindness.[2] In North America and most

of Europe, legal blindness is defined as visual acuity (vision) of 20/200 (6/60) or

less in the better eye with best correction possible. This means that a legally

blind individual would have to stand 20 feet (6.1 m) from an object to see it—

with corrective lenses—with the same degree of clarity as a normally sighted

person could from 200 feet (61 m). In many areas, people with average acuity

who nonetheless have a visual field of less than 20 degrees (the norm being 180

degrees) are also classified as being legally blind. Approximately ten percent of

those deemed legally blind, by any measure, have no vision.

Page 4: Text-To-Speech Technology-Based Programming Tool Final Doc

4

The rest have some vision, from light perception alone to relatively good

acuity. Low vision is sometimes used to describe visual acuities from 20/70 to

20/200.[3]

By the 10th Revision of the WHO International Statistical Classification of

Diseases, Injuries and Causes of Death, low vision is defined as visual acuity of

less than 20/60 (6/18), but equal to or better than 20/200 (6/60), or corresponding

visual field loss to less than 20 degrees, in the better eye with best possible

correction. Blindness is defined as visual acuity of less than 20/400 (6/120), or

corresponding visual field loss to less than 10 degrees, in the better eye with best

possible correction.[4][5]

Blind people with undamaged eyes may still register light non-visually for the

purpose of circadian entrainment to the 24-hour light/dark cycle. Light signals for

this purpose travel through the retinohypothalamic tract, so a damaged optic

nerve beyond where the retinohypothalamic tract exits it is no hindrance

Causes

Serious visual impairment has a variety of causes:

Page 5: Text-To-Speech Technology-Based Programming Tool Final Doc

5

Diseases

According to WHO estimates, the most common causes of blindness around the

world in 2002 were:

1. cataracts (47.9%),

2. glaucoma (12.3%),

3. age-related macular degeneration (8.7%),

4. corneal opacity (5.1%), and

5. diabetic retinopathy (4.8%),

6. childhood blindness (3.9%),

7. trachoma (3.6%)

8. onchocerciasis (0.8%).[13]

9.

In terms of the worldwide prevalence of blindness, the vastly greater number of

people in the developing world and the greater likelihood of their being affected

mean that the causes of blindness in those areas are numerically more

important. Cataract is responsible for more than 22 million cases of blindness

and glaucoma 6 million, while leprosy and onchocerciasis each blind

approximately 1 million individuals worldwide. The number of individuals blind

from trachoma has dropped dramatically in the past 10 years from 6 million to 1.3

million, putting it in seventh place on the list of causes of blindness worldwide.

Xerophthalmia is estimated to affect 5 million children each year; 500,000

develop active corneal involvement, and half of these go blind. Central corneal

ulceration is also a significant cause of monocular blindness worldwide,

accounting for an estimated 850,000 cases of corneal blindness every year in the

Indian subcontinent alone. As a result, corneal scarring from all causes now is

Page 6: Text-To-Speech Technology-Based Programming Tool Final Doc

6the fourth greatest cause of global blindness (Vaughan & Asbury's General

Ophthalmology, 17e)

People in developing countries are significantly more likely to experience visual

impairment as a consequence of treatable or preventable conditions than are

their counterparts in the developed world. While vision impairment is most

common in people over age 60 across all regions, children in poorer communities

are more likely to be affected by blinding diseases than are their more affluent

peers.

The link between poverty and treatable visual impairment is most obvious when

conducting regional comparisons of cause. Most adult visual impairment in North

America and Western Europe is related to age-related macular degeneration and

diabetic retinopathy. While both of these conditions are subject to treatment,

neither can be cured.

In developing countries, wherein people have shorter life expectancies, cataracts

and water-borne parasites—both of which can be treated effectively—are most

often the culprits (see river blindness, for example). Of the estimated 40 million

blind people located around the world, 70–80% can have some or all of their

sight restored through treatment.

In developed countries where parasitic diseases are less common and cataract

surgery is more available, age-related macular degeneration, glaucoma, and

diabetic retinopathy are usually the leading causes of blindness.[14]

Page 7: Text-To-Speech Technology-Based Programming Tool Final Doc

7Childhood blindness can be caused by conditions related to pregnancy, such

as congenital rubella syndrome and retinopathy of prematurity.

Abnormalities and injuries

Eye injuries, most often occurring in people under 30, are the leading cause of

monocular blindness (vision loss in one eye) throughout the United States.

Injuries and cataracts affect the eye itself, while abnormalities such as optic

nerve hypoplasia affect the nerve bundle that sends signals from the eye to the

back of the brain, which can lead to decreased visual acuity.

People with injuries to the occipital lobe of the brain can, despite having

undamaged eyes and optic nerves, still be legally or totally blind.

Genetic defects

People with albinism often have vision loss to the extent that many are legally

blind, though few of them actually cannot see. Leber's congenital amaurosis can

cause total blindness or severe sight loss from birth or early childhood.

Recent advances in mapping of the human genome have identified other genetic

causes of low vision or blindness. One such example is Bardet-Biedl syndrome.

Poisoning

Page 8: Text-To-Speech Technology-Based Programming Tool Final Doc

8Rarely, blindness is caused by the intake of certain chemicals. A well-known

example is methanol, which is only mildly toxic and minimally intoxicating, but

when not competing with ethanol for metabolism, methanol breaks down into the

substances formaldehyde and formic acid which in turn can cause blindness, an

array of other health complications, and death.[15] Methanol is commonly found

in methylated spirits, denatured ethyl alcohol, to avoid paying taxes on selling

ethanol intended for human consumption. Methylated spirits are sometimes used

by alcoholics as a desperate and cheap substitute for regular ethanol alcoholic

beverages.

Willful actions

Blinding has been used as an act of vengeance and torture in some instances, to

deprive a person of a major sense by which they can navigate or interact within

the world, act fully independently, and be aware of events surrounding them. An

example from the classical realm is Oedipus, who gouges out his own eyes after

realizing that he fulfilled the awful prophecy spoken of him.

In 2003, a Pakistani anti-terrorism court sentenced a man to be blinded after he

carried out an acid attack against his fiancee that resulted in her blinding.[16] The

same sentence was given in 2009 for the man who blinded Ameneh Bahrami.

comorbidities

Page 9: Text-To-Speech Technology-Based Programming Tool Final Doc

9Blindness can occur in combination with such conditions as mental

retardation, autism, cerebral palsy, hearing impairments, and epilepsy.[17][18] In a

study of 228 visually impaired children inmetropolitan Atlanta between 1991 and

1993, 154 (68%) had an additional disability besides visual impairment.[17] Blindness in combination with hearing loss is known as deafblindness.

Management

A 2008 study published in the New England Journal of Medicine[19] tested the

effect of using gene therapy to help restore the sight of patients with a rare form

of inherited blindness, known as Leber Congenital Amaurosis or LCA. Leber

Congenital Amaurosis damages the light receptors in the retina and usually

begins affecting sight in early childhood, with worsening vision until complete

blindness around the age of 30.

The study used a common cold virus to deliver a normal version of the gene

called RPE65 directly into the eyes of affected patients. Remarkably all 3 patients

aged 19, 22 and 25 responded well to the treatment and reported improved

vision following the procedure. Due to the age of the patients and the

degenerative nature of LCA the improvement of vision in gene therapy patients is

encouraging for researchers. It is hoped that gene therapy may be even more

effective in younger LCA patients who have experienced limited vision loss as

well as in other blind or partially blind individuals.

Two experimental treatments for retinal problems include a cybernetic

replacement and transplant of fetal retinal cells.[20]

Page 10: Text-To-Speech Technology-Based Programming Tool Final Doc

10

Adaptive techniques and aids

Mobility

Folded long cane.

Many people with serious visual impairments can travel independently, using a

wide range of tools and techniques. Orientation and mobility specialists are

professionals who are specifically trained to teach people with visual impairments

how to travel safely, confidently, and independently in the home and the

community. These professionals can also help blind people to practice travelling

Page 11: Text-To-Speech Technology-Based Programming Tool Final Doc

11on specific routes which they may use often, such as the route from one's house

to a convenience store. Becoming familiar with an environment or route can

make it much easier for a blind person to navigate successfully.

Tools such as the white cane with a red tip - the international symbol of blindness

- may also be used to improve mobility. A long cane is used to extend the user's

range of touch sensation. It is usually swung in a low sweeping motion, across

the intended path of travel, to detect obstacles.

However, techniques for cane travel can vary depending on the user and/or the

situation. Some visually impaired persons do not carry these kinds of canes,

opting instead for the shorter, lighter identification (ID) cane. Still others require a

support cane. The choice depends on the individual's vision, motivation, and

other factors.

A small number of people employ guide dogs to assist in mobility. These dogs

are trained to navigate around various obstacles, and to indicate when it

becomes necessary to go up or down a step. However, the helpfulness of guide

dogs is limited by the inability of dogs to understand complex directions. The

human half of the guide dog team does the directing, based upon skills acquired

through previous mobility training. In this sense, the handler might be likened to

an aircraft's navigator, who must know how to get from one place to another, and

the dog to the pilot, who gets them there safely.

In addition, some blind people use software using GPS technology as a mobility

aid. Such software can assist blind people with orientation and navigation, but it

Page 12: Text-To-Speech Technology-Based Programming Tool Final Doc

12is not a replacement for traditional mobility tools such as white canes and guide

dogs.

Government actions are sometimes taken to make public places more accessible

to blind people. Public transportation is freely available to the blind in many

cities. Tactile paving and audible traffic signals can make it easier and safer for

visually impaired pedestrians to cross streets. In addition to making rules about

who can and cannot use a cane, some governments mandate the right-of-way be

given to users of white canes or guide dogs.

Reading and magnification

Watch for the blind

Page 13: Text-To-Speech Technology-Based Programming Tool Final Doc

13Most visually impaired people who are not totally blind read print, either of a

regular size or enlarged by magnification devices. Many also read large-print,

which is easier for them to read without such devices. A variety of magnifying

glasses, some handheld, and some on desktops, can make reading easier for

them.

Others read Braille (or the infrequently used Moon type), or rely on talking

books and readers or reading machines, which convert printed text to speech

orBraille. They use computers with special hardware such

as scanners and refreshable Braille displays as well as software written

specifically for the blind, such as optical character recognition applications

and screen readers.

Some people access these materials through agencies for the blind, such as

the National Library Service for the Blind and Physically Handicapped in the

United States, the National Library for the Blind or the RNIB in the United

Kingdom.

Closed-circuit televisions, equipment that enlarges and contrasts textual items,

are a more high-tech alternative to traditional magnification devices.

There are also over 100 radio reading services throughout the world that provide

people with vision impairments with readings from periodicals over the radio. The

International Association of Audio Information Services provides links to all of

these organizations.

Page 14: Text-To-Speech Technology-Based Programming Tool Final Doc

14Computers

Access technology such as screen readers, screen magnifiers and refreshable

Braille displays enable the blind to use mainstream computer applications

andmobile phones. The availability of assistive technology is increasing,

accompanied by concerted efforts to ensure the accessibility of information

technology to all potential users, including the blind. Later versions of Microsoft

Windows include an Accessibility Wizard & Magnifier for those with partial vision,

andMicrosoft Narrator, a simple screen reader. Linux distributions (as live CDs)

for the blind include Oralux and Adriane Knoppix, the latter developed in part

byAdriane Knopper who has a visual impairment. Mac OS also comes with a

built-in screen reader, called VoiceOver.

The movement towards greater web accessibility is opening a far wider number

of websites to adaptive technology, making the web a more inviting place for

visually impaired surfers.

Experimental approaches in sensory substitution are beginning to provide access

to arbitrary live views from a camera.

Other aids and techniques

Page 15: Text-To-Speech Technology-Based Programming Tool Final Doc

15

A tactile feature on a Canadian banknote.

Blind people may use talking equipment such as thermometers, watches,

clocks, scales, calculators, and compasses. They may also enlarge or mark dials

on devices such as ovens and thermostats to make them usable. Other

techniques used by blind people to assist them in daily activities include:

Adaptations of coins and banknotes so that the value can be determined by touch.

For example:

In some currencies, such as the euro, the pound sterling and the Indian rupee, the

size of a note increases with its value.

On US coins, pennies and dimes, and nickels and quarters are similar in size. The

larger denominations (dimes and quarters) have ridges along the sides

(historically used to prevent the "shaving" of precious metals from the coins),

which can now be used for identification.

Page 16: Text-To-Speech Technology-Based Programming Tool Final Doc

16

Epidemiology

The WHO estimates that in 2002 there were 161 million visually impaired people

in the world (about 2.6% of the total population). Of this number 124 million

(about 2%) had low vision and 37 million (about 0.6%) were blind.[22] In order of

frequency the leading causes were cataract, uncorrected refractive errors (near

sighted, far sighted, or an astigmatism), glaucoma, and age-related macular

degeneration.[23] In 1987, it was estimated that 598,000 people in the United

States met the legal definition of blindness.[24] Of this number, 58% were over the

age of 65.[24] In 1994-1995, 1.3 million Americans reported legal blindness.[25]

Page 17: Text-To-Speech Technology-Based Programming Tool Final Doc

17

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer

system used for this purpose is called a speech synthesizer, and can be

implemented in software or hardware. A text-to-speech (TTS) system converts

normal language text into speech; other systems render symbolic linguistic

representations like phonetic transcriptions into speech.[1]

Synthesized speech can be created by concatenating pieces of recorded speech

that are stored in a database. Systems differ in the size of the stored speech

units; a system that stores phones ordiphones provides the largest output

range, but may lack clarity.

For specific usage domains, the storage of entire words or sentences allows for

high-quality output. Alternatively, a synthesizer can incorporate a model of

the vocal tract and other human voice characteristics to create a completely

"synthetic" voice output.[2]

The quality of a speech synthesizer is judged by its similarity to the human voice

and by its ability to be understood. An intelligible text-to-speech program allows

people with visual impairments orreading disabilities to listen to written works

on a home computer. Many computer operating systems have included speech

synthesizers since the early 1980s.

Page 18: Text-To-Speech Technology-Based Programming Tool Final Doc

18Overview of text processing

Overview of a typical TTS system

A text-to-speech system (or "engine") is composed of two parts[3]: a front-end and

a back-end. The front-end has two major tasks. First, it converts raw text

containing symbols like numbers and abbreviations into the equivalent of written-

out words. This process is often called text normalization, pre-processing,

ortokenization. The front-end then assigns phonetic transcriptions to each word,

and divides and marks the text into prosodic units, like phrases, clauses,

andsentences. The process of assigning phonetic transcriptions to words is

called text-to-phoneme or grapheme-to-phoneme conversion.

Phonetic transcriptions and prosody information together make up the symbolic

linguistic representation that is output by the front-end. The back-end—often

referred to as thesynthesizer—then converts the symbolic linguistic

representation into sound. In certain systems, this part includes the computation

of the target prosody(pitch contour, phoneme durations[4]), which is then imposed

on the output speech.

Page 19: Text-To-Speech Technology-Based Programming Tool Final Doc

19History

Long before electronic signal processing was invented, there were those who

tried to build machines to create human speech. Some early legends of the

existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus

Magnus (1198–1280), and Roger Bacon (1214–1294).

In 1779, the Danish scientist Christian Kratzenstein, working at the Russian

Academy of Sciences, built models of the human vocal tract that could produce

the five long vowel sounds (in International Phonetic Alphabet notation, they

are [aː], [eː], [iː], [oː] and [uː]).[5] This was followed by the bellows-operated

"acoustic-mechanical speech machine" by Wolfgang von

Kempelen of Vienna, Austria, described in a 1791 paper.[6] This machine added

models of the tongue and lips, enabling it to produce consonants as well as

vowels. In 1837,Charles Wheatstone produced a "speaking machine" based on

von Kempelen's design, and in 1857, M. Faber built the "Euphonia".

Wheatstone's design was resurrected in 1923 by Paget.[7]

In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated

electronic speech analyzer and synthesizer that was said to be clearly

intelligible. Homer Dudley refined this device into the VODER, which he exhibited

at the 1939 New York World's Fair.

Page 20: Text-To-Speech Technology-Based Programming Tool Final Doc

20The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues

at Haskins Laboratories in the late 1940s and completed in 1950. There were

several different versions of this hardware device but only one currently survives.

The machine converts pictures of the acoustic patterns of speech in the form of a

spectrogram back into sound. Using this device, Alvin Liberman and colleagues

were able to discover acoustic cues for the perception of phonetic segments

(consonants and vowels).

Dominant systems in the 1980s and 1990s were the MITalk system, based

largely on the work of Dennis Klatt at MIT, and the Bell Labs system;[8] the latter

was one of the first multilingual language-independent systems, making

extensive use of Natural Language Processing methods.

Early electronic speech synthesizers sounded robotic and were often barely

intelligible. The quality of synthesized speech has steadily improved, but output

from contemporary speech synthesis systems is still clearly distinguishable from

actual human speech.

As the cost-performance ratio causes speech synthesizers to become cheaper

and more accessible to the people, more people will benefit from the use of text-

to-speech programs.[9]

Page 21: Text-To-Speech Technology-Based Programming Tool Final Doc

21Electronic devices

The first computer-based speech synthesis systems were created in the late

1950s, and the first complete text-to-speech system was completed in 1968. In

1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman[10] used

an IBM 704 computer to synthesize speech, an event among the most prominent

in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated

the song "Daisy Bell", with musical accompaniment from Max Mathews.

Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce

at the Bell Labs Murray Hill facility. Clarke was so impressed by the

demonstration that he used it in the climactic scene of his screenplay for his

novel 2001: A Space Odyssey,[11] where the HAL 9000 computer sings the same

song as it is being put to sleep by astronaut Dave Bowman.[12] Despite the

success of purely electronic speech synthesis, research is still being conducted

into mechanical speech synthesizers.[13]

Handheld electronics featuring speech synthesis began emerging in the 1970s.

One of the first was the Telesensory Systems Inc. (TSI) Speech+ portable

calculator for the blind in 1976.[14][15] Other devices were produced primarily for

educational purposes, such as Speak & Spell, produced by Texas

Instruments[16] in 1978. The first multi-player game using voice synthesis

was Milton from Milton Bradley Company, which produced the device in 1980.

Page 22: Text-To-Speech Technology-Based Programming Tool Final Doc

22Synthesizer technologies

The most important qualities of a speech synthesis system

are naturalness and intelligibility. Naturalness describes how closely the output

sounds like human speech, while intelligibility is the ease with which the output is

understood. The ideal speech synthesizer is both natural and intelligible. Speech

synthesis systems usually try to maximize both characteristics.

The two primary technologies for generating synthetic speech waveforms

are concatenative synthesis and formant synthesis. Each technology has

strengths and weaknesses, and the intended uses of a synthesis system will

typically determine which approach is used.

Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of

segments of recorded speech. Generally, concatenative synthesis produces the

most natural-sounding synthesized speech. However, differences between

natural variations in speech and the nature of the automated techniques for

segmenting the waveforms sometimes result in audible glitches in the output.

There are three main sub-types of concatenative synthesis.

Unit selection synthesis

Page 23: Text-To-Speech Technology-Based Programming Tool Final Doc

23Unit selection synthesis uses large databases of recorded speech. During

database creation, each recorded utterance is segmented into some or all of the

following: individual phones, diaphones, half-

phones, syllables, morphemes, words, phrases, and sentences. Typically, the

division into segments is done using a specially modified speech recognizer set

to a "forced alignment" mode with some manual correction afterward, using

visual representations such as the waveform and spectrogram.[17] An index of the

units in the speech database is then created based on the segmentation and

acoustic parameters like the fundamental frequency (pitch), duration, position in

the syllable, and neighboring phones. At runtime, the desired target utterance is

created by determining the best chain of candidate units from the database (unit

selection). This process is typically achieved using a specially weighted decision

tree.

Unit selection provides the greatest naturalness, because it applies only a small

amount of digital signal processing (DSP) to the recorded speech. DSP often

makes recorded speech sound less natural, although some systems use a small

amount of signal processing at the point of concatenation to smooth the

waveform. The output from the best unit-selection systems is often

indistinguishable from real human voices, especially in contexts for which the

TTS system has been tuned. However, maximum naturalness typically require

unit-selection speech databases to be very large, in some systems ranging into

the gigabytes of recorded data, representing dozens of hours of speech.[18] Also,

unit selection algorithms have been known to select segments from a place that

Page 24: Text-To-Speech Technology-Based Programming Tool Final Doc

24results in less than ideal synthesis (e.g. minor words become unclear) even when

a better choice exists in the database.[19]

Diaphone synthesis

Diphone synthesis uses a minimal speech database containing all

the diphones (sound-to-sound transitions) occurring in a language. The number

of diphones depends on the phonotactics of the language: for example, Spanish

has about 800 diphones, and German about 2500. In diphone synthesis, only

one example of each diphone is contained in the speech database. At runtime,

the targetprosody of a sentence is superimposed on these minimal units by

means of digital signal processing techniques such as linear predictive

coding, PSOLA[20] or MBROLA.[21] The quality of the resulting speech is generally

worse than that of unit-selection systems, but more natural-sounding than the

output of formant synthesizers. Diphone synthesis suffers from the sonic glitches

of concatenative synthesis and the robotic-sounding nature of formant synthesis,

and has few of the advantages of either approach other than small size. As such,

its use in commercial applications is declining, although it continues to be used in

research because there are a number of freely available software

implementations.

Domain-specific synthesis

Page 25: Text-To-Speech Technology-Based Programming Tool Final Doc

25Domain-specific synthesis concatenates prerecorded words and phrases to

create complete utterances. It is used in applications where the variety of texts

the system will output is limited to a particular domain, like transit schedule

announcements or weather reports.[22] The technology is very simple to

implement, and has been in commercial use for a long time, in devices like

talking clocks and calculators. The level of naturalness of these systems can be

very high because the variety of sentence types is limited, and they closely

match the prosody and intonation of the original recordings.[citation needed]

Because these systems are limited by the words and phrases in their databases,

they are not general-purpose and can only synthesize the combinations of words

and phrases with which they have been preprogrammed. The blending of words

within naturally spoken language however can still cause problems unless the

many variations are taken into account. For example, in non-rhotic dialects of

English the "r" in words like "clear" /ˈkliːə/ is usually only pronounced when the

following word has a vowel as its first letter (e.g. "clear out" is realized as /ˌkliːəɹ

ˈɑʊt/). Likewise in French, many final consonants become no longer silent if

followed by a word that begins with a vowel, an effect called liaison.

This alternation cannot be reproduced by a simple word-concatenation system,

which would require additional complexity to be context-sensitive.

Formant synthesis

Page 26: Text-To-Speech Technology-Based Programming Tool Final Doc

26Formant synthesis does not use human speech samples at runtime. Instead, the

synthesized speech output is created using additive synthesis and an acoustic

model (physical modelling synthesis).[23] Parameters such as fundamental

frequency, voicing, and noise levels are varied over time to create a waveform of

artificial speech. This method is sometimes called rules-based synthesis;

however, many concatenative systems also have rules-based components. Many

systems based on formant synthesis technology generate artificial, robotic-

sounding speech that would never be mistaken for human speech. However,

maximum naturalness is not always the goal of a speech synthesis system, and

formant synthesis systems have advantages over concatenative systems.

Formant-synthesized speech can be reliably intelligible, even at very high

speeds, avoiding the acoustic glitches that commonly plague concatenative

systems. High-speed synthesized speech is used by the visually impaired to

quickly navigate computers using a screen reader. Formant synthesizers are

usually smaller programs than concatenative systems because they do not have

a database of speech samples. They can therefore be used in embedded

systems, where memory and microprocessor power are especially limited.

Because formant-based systems have complete control of all aspects of the

output speech, a wide variety of prosodies and intonations can be output,

conveying not just questions and statements, but a variety of emotions and tones

of voice.

Examples of non-real-time but highly accurate intonation control in formant

synthesis include the work done in the late 1970s for the Texas

Page 27: Text-To-Speech Technology-Based Programming Tool Final Doc

27Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines.[24] and in many Atari, Inc. arcade games[25] using the TMS5220 LPC Chips.

Creating proper intonation for these projects was painstaking, and the results

have yet to be matched by real-time text-to-speech interfaces.[26]

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech

based on models of the human vocal tract and the articulation processes

occurring there. The first articulatory synthesizer regularly used for laboratory

experiments was developed at Haskins Laboratories in the mid-1970s by Philip

Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was

based on vocal tract models developed at Bell Laboratories in the 1960s and

1970s by Paul Mermelstein, Cecil Coker, and colleagues.

Until recently, articulatory synthesis models have not been incorporated into

commercial speech synthesis systems. A notable exception is the NeXT-based

system originally developed and marketed by Trillium Sound Research, a spin-off

company of the University of Calgary, where much of the original research was

conducted. Following the demise of the various incarnations of NeXT (started

bySteve Jobs in the late 1980s and merged with Apple Computer in 1997), the

Trillium software was published under the GNU General Public License, with

work continuing as gnu speech.

Page 28: Text-To-Speech Technology-Based Programming Tool Final Doc

28 The system, first marketed in 1994, provides full articulatory-based text-to-

speech conversion using a waveguide or transmission-line analog of the human

oral and nasal tracts controlled by Carré's "distinctive region model".

HMM-based synthesis

HMM-based synthesis is a synthesis method based on hidden Markov models,

also called Statistical Parametric Synthesis. In this system, the frequency

spectrum (vocal tract), fundamental frequency(vocal source), and duration

(prosody) of speech are modeled simultaneously by HMMs.

Speech waveforms are generated from HMMs themselves based on

the maximum likelihood criterion.[27]

Sine wave synthesis

Sine wave synthesis is a technique for synthesizing speech by replacing

the formants (main bands of energy) with pure tone whistles.[28]

Challenges

Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full

of heteronyms, numbers, and abbreviations that all require expansion into a

phonetic representation.

Page 29: Text-To-Speech Technology-Based Programming Tool Final Doc

29There are many spellings in English which are pronounced differently based on

context. For example, "My latest project is to learn how to better project my

voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate semantic representations of

their input texts, as processes for doing so are not reliable, well understood, or

computationally effective. As a result, various heuristic techniques are used to

guess the proper way to disambiguate homographs, like examining neighboring

words and using statistics about frequency of occurrence.

Recently TTS systems have begun to use HMMs (discussed above) to generate

"parts of speech" to aid in disambiguating homographs. This technique is quite

successful for many cases such as whether "read" should be pronounced as

"red" implying past tense, or as "reed" implying present tense. Typical error rates

when using HMMs in this fashion are usually below five percent. These

techniques also work well for most European languages, although access to

required training corpora is frequently difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems have to

address. It is a simple programming challenge to convert a number into words (at

least in English), like "1325" becoming "one thousand three hundred twenty-five."

However, numbers occur in many different contexts; "1325" may also be read as

"one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five".

Page 30: Text-To-Speech Technology-Based Programming Tool Final Doc

30 A TTS system can often infer how to expand a number based on surrounding

words, numbers, and punctuation, and sometimes the system provides a way to

specify the context if it is ambiguous.[29] Roman numerals can also be read

differently depending on context. For example "Henry VIII" reads as "Henry the

Eighth", while "Chapter VIII" reads as "Chapter Eight".

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for

"inches" must be differentiated from the word "in", and the address "12 St John

St." uses the same abbreviation for both "Saint" and "Street". TTS systems with

intelligent front ends can make educated guesses about ambiguous

abbreviations, while others provide the same result in all cases, resulting in

nonsensical (and sometimes comical) outputs.

Text-to-phoneme challenges

Speech synthesis systems use two basic approaches to determine the

pronunciation of a word based on its spelling, a process which is often called

text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term

used by linguists to describe distinctive sounds in a language). The simplest

approach to text-to-phoneme conversion is the dictionary-based approach, where

a large dictionary containing all the words of a language and their correct

pronunciations is stored by the program. Determining the correct pronunciation of

each word is a matter of looking up each word in the dictionary and replacing the

spelling with the pronunciation specified in the dictionary.

Page 31: Text-To-Speech Technology-Based Programming Tool Final Doc

31The other approach is rule-based, in which pronunciation rules are applied to

words to determine their pronunciations based on their spellings. This is similar

to the "sounding out", or synthetic phonics, approach to learning reading.

Each approach has advantages and drawbacks. The dictionary-based approach

is quick and accurate, but completely fails if it is given a word which is not in its

dictionary.[citation needed] As dictionary size grows, so too does the memory space

requirements of the synthesis system. On the other hand, the rule-based

approach works on any input, but the complexity of the rules grows substantially

as the system takes into account irregular spellings or pronunciations. (Consider

that the word "of" is very common in English, yet is the only word in which the

letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use

a combination of these approaches.

Languages with a phonemic orthography have a very regular writing system, and

the prediction of the pronunciation of words based on their spellings is quite

successful. Speech synthesis systems for such languages often use the rule-

based method extensively, resorting to dictionaries only for those few words, like

foreign names and borrowings, whose pronunciations are not obvious from their

spellings. On the other hand, speech synthesis systems for languages

like English, which have extremely irregular spelling systems, are more likely to

rely on dictionaries, and to use rule-based methods only for unusual words, or

words that aren't in their dictionaries.

Evaluation challenges

Page 32: Text-To-Speech Technology-Based Programming Tool Final Doc

32The consistent evaluation of speech synthesis systems may be difficult because

of a lack of universally agreed objective evaluation criteria. Different

organizations often use different speech data. The quality of speech synthesis

systems also depends to a large degree on the quality of the production

technique (which may involve analogue or digital recording) and on the facilities

used to replay the speech. Evaluating speech synthesis systems has therefore

often been compromised by differences between production techniques and

replay facilities.

Recently, however, some researchers have started to evaluate speech synthesis

systems using a common speech dataset.[30]

Prosodics and emotional content

A recent study reported in the journal "Speech Communication" by Amy

Drahota and colleagues at the University of Portsmouth, UK, reported that

listeners to voice recordings could determine, at better than chance levels,

whether or not the speaker was smiling.[31] It was suggested that identification of

the vocal features which signal emotional content may be used to help make

synthesized speech sound more natural.

Page 33: Text-To-Speech Technology-Based Programming Tool Final Doc

33Dedicated hardware

Votrax

SC-01A (analog formant)

SC-02 / SSI-263 / "Arctic 263"

General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)

Magnevation SpeakJet (www.speechchips.com TTS256)

Savage Innovations SoundGin

National Semiconductor DT1050 Digitalker (Mozer)

Silicon Systems SSI 263 (analog formant)

Texas Instruments LPC Speech Chips

TMS5110A

TMS5200

Oki Semiconductor

ML22825 (ADPCM)

ML22573 (HQADPCM)

Toshiba T6721A

Philips PCF8200

TextSpeak Embedded TTS Modules

Computer operating systems or outlets with speech synthesis

Page 34: Text-To-Speech Technology-Based Programming Tool Final Doc

34Atari

Arguably, the first speech system integrated into an operating system was the

1400XL/1450XL personal computers designed by Atari, Inc. using the Votrax

SC01 chip in 1983. The 1400XL/1450XL computers used a Finite State Machine

to enable World English Spelling text-to-speech synthesis.[32] Unfortunately, the

1400XL/1450XL personal computers never shipped in quantity.

The Atari ST computers were sold with "stspeech.tos" on floppy disk.

Apple

The first speech system integrated into an operating system that shipped in

quantity was Apple Computer's MacInTalk in 1984. Since the 1980s Macintosh

Computers offered text to speech capabilities through The MacinTalk software. In

the early 1990s Apple expanded its capabilities offering system wide text-to-

speech support. With the introduction of faster PowerPC-based computers they

included higher quality voice sampling. Apple also introduced speech

recognition into its systems which provided a fluid command set. More recently,

Apple has added sample-based voices. Starting as a curiosity, the speech

system of Apple Macintosh has evolved into a fully-supported

program, PlainTalk, for people with vision problems. VoiceOver was for the first

time featured in Mac OS X Tiger (10.4).

During 10.4 (Tiger) & first releases of 10.5 (Leopard) there was only one

standard voice shipping with Mac OS X. Starting with 10.6 (Snow Leopard), the

Page 35: Text-To-Speech Technology-Based Programming Tool Final Doc

35user can choose out of a wide range list of multiple voices. VoiceOver voices

feature the taking of realistic-sounding breaths between sentences, as well as

improved clarity at high read rates over PlainTalk. Mac OS X also includessay,

a command-line based application that converts text to audible speech.

The AppleScript Standard Additions includes a say verb that allows a script to

use any of the installed voices and to control the pitch, speaking rate and

modulation of the spoken text.

AmigaOS

The second operating system with advanced speech synthesis capabilities

was AmigaOS, introduced in 1985. The voice synthesis was licensed

by Commodore International from a third-party software house (Don't Ask

Software, now Softvoice, Inc.) and it featured a complete system of voice

emulation, with both male and female voices and "stress" indicator markers,

made possible by advanced features of the Amiga hardware audio chipset.[33] It

was divided into a narrator device and a translator library. Amiga Speak

Handler featured a text-to-speech translator. AmigaOS considered speech

synthesis a virtual hardware device, so the user could even redirect console

output to it. Some Amiga programs, such as word processors, made extensive

use of the speech system.

Microsoft Windows

Modern Windows systems use SAPI4- and SAPI5-based speech systems that

include a speech recognition engine (SRE). SAPI 4.0 was available on Microsoft-

Page 36: Text-To-Speech Technology-Based Programming Tool Final Doc

36based operating systems as a third-party add-on for systems like Windows

95 and Windows 98. Windows 2000 added a speech synthesis program

called Narrator, directly available to users. All Windows-compatible programs

could make use of speech synthesis features, available through menus once

installed on the system. Microsoft Speech Server is a complete package for voice

synthesis and recognition, for commercial applications such as call centers.

Text-to-Speech (TTS) capabilities for a computer refers to the ability to play

back text in a spoken voice. TTS is the ability of the operating system to play

back printed text as spoken words.[34]

An internal (installed with the operating system) driver (called a TTS engine):

recognizes the text and using a synthesized voice (chosen from several pre-

generated voices) speaks the written text. Additional engines (often use a certain

jargon or vocabulary) are also available through third-party manufacturers.[34]

Android

Version 1.6 of Android added support for speech synthesis (TTS).[35]

Internet

The most recent TTS development in the web browser, is the JavaScript Text to

Speech work of Yury Delendik, which ports the Flite C engine to pure JavaScript.

This allows web pages to convert text to audio using HTML5 technology. The

ability to use Yury's TTS port currently requires a custom browser build that uses

Mozilla's Audio-Data-API. However, much work is being done in the context of

the W3C to move this technology into the mainstream browser market through

the W3C Audio Incubator Group with the involvement of The BBC and Google

Inc.

Currently, there are a number of applications, plugging and gadgets that can

read messages directly from an e-mail client and web pages from a web

browser or Google Toolbar such as voice which is an add-on to Firefox . Some

specialized software can narrate RSS-feeds. On one hand, online RSS-narrators

simplify information delivery by allowing users to listen to their favorite news

Page 37: Text-To-Speech Technology-Based Programming Tool Final Doc

37sources and to convert them to podcasts. On the other hand, on-line RSS-

readers are available on almost any PC connected to the Internet. Users can

download generated audio files to portable devices, e.g. with a help

of podcast receiver, and listen to them while walking, jogging or commuting to

work.

A growing field in internet based TTS is web-based assistive technology, e.g.

'Browsealoud' from a UK company and Read speaker. It can deliver TTS

functionality to anyone (for reasons of accessibility, convenience, entertainment

or information) with access to a web browser. The non-

profit project Pediaphon was created in 2006 to provide a similar web-based TTS

interface to the Wikipedia.[36]Additionally SPEAK.TO.ME from Oxford Information

Laboratories is capable of delivering text to speech through any browser without

the need to download any special applications, and includes smart delivery

technology to ensure only what is seen is spoken and the content is logically

pathed.

]Others

Some models of Texas Instruments home computers produced in 1979

and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-

to-phoneme synthesis or reciting complete words and phrases (text-to-

dictionary), using a very popular Speech Synthesizer peripheral.

TI used a proprietary codec to embed complete spoken phrases into

applications, primarily video games.[37]

IBM's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.

Page 38: Text-To-Speech Technology-Based Programming Tool Final Doc

38 Systems that operate on free and open source software systems

including Linux are various, and include open-source programs such as

the Festival Speech Synthesis System which uses diphone-based

synthesis (and can use a limited number of MBROLA voices),

and gnuspeech which uses articulatory synthesis[38] from the Free

Software Foundation.

Companies which developed speech synthesis systems but which are no

longer in this business include BeST Speech (bought by L&H), Eloquent

Technology (bought by SpeechWorks), Lernout & Hauspie (bought by

Nuance), SpeechWorks (bought by Nuance), Rhetorical Systems (bought

by Nuance).

Speech synthesis markup languages

A number of markup languages have been established for the rendition of text as

speech in an XML-compliant format. The most recent is Speech Synthesis

Markup Language (SSML), which became aW3C recommendation in 2004.

Older speech synthesis markup languages include Java Speech Markup

Language (JSML) and SABLE. Although each of these was proposed as a

standard, none of them has been widely adopted.

Speech synthesis markup languages are distinguished from dialogue markup

languages. VoiceXML, for example, includes tags related to speech recognition,

dialogue management and touchtone dialing, in addition to text-to-speech

markup.

Page 39: Text-To-Speech Technology-Based Programming Tool Final Doc

39Applications

Speech synthesis has long been a vital assistive technology tool and its

application in this area is significant and widespread. It allows environmental

barriers to be removed for people with a wide range of disabilities. The longest

application has been in the use of screen readers for people with visual

impairment, but text-to-speech systems are now commonly used by people

with dyslexia and other reading difficulties as well as by pre-literate children.

They are also frequently employed to aid those with severe speech

impairment usually through a dedicated voice output communication aid.

Sites such as Ananova and YAKiToMe! have used speech synthesis to convert

written news to audio content, which can be used for mobile applications.

Speech synthesis techniques are used as well in the entertainment productions

such as games, anime and similar. In 2007, Animo Limited announced the

development of a software application package based on its speech synthesis

software FineSpeech, explicitly geared towards customers in the entertainment

industries, able to generate narration and lines of dialogue according to user

specifications.[39] 

The application reached maturity in 2008, when NEC Biglobe announced a web

service that allows users to create phrases from the voices of Code Geass:

Lelouch of the Rebellion R2 characters.[40]

Page 40: Text-To-Speech Technology-Based Programming Tool Final Doc

40TTS applications such as YAKiToMe! and Speakonia are often used to add

synthetic voices to YouTube videos for comedic effect, as in Barney Bunch

videos. YAKiToMe! is also used to convert entire books for personal podcasting

purposes, RSS feeds and web pages for news stories, and educational texts for

enhanced learning.

Software such as Vocaloid can generate singing voices via lyrics and melody.

This is also the aim of the Singing Computer project (which uses GNU

LilyPond and Festival) to help blind people check their lyric input.[41]

Next to these applications is the use of text to speech software also popular

in Interactive Voice Response systems, often in combination with speech

recognition. Examples of such voices can be found

at speechsynthesissoftware.com or Nextup.

C Sharp (programming language)

C# (pronounced "see sharp")[6] is a multi-paradigm programming

language encompassing imperative, declarative, functional, generic, object-

Page 41: Text-To-Speech Technology-Based Programming Tool Final Doc

41oriented (class-based), and component-oriented programming disciplines. It was

developed by Microsoft within the .NET initiative and later approved as a

standard by Ecma(ECMA-334) and ISO (ISO/IEC 23270). C# is one of the

programming languages designed for the Common Language Infrastructure.

C# is intended to be a simple, modern, general-purpose, object-oriented

programming language.[7] Its development team is led by Anders Hejlsberg. The

most recent version is C# 4.0, which was released on April 12, 2010.

Microsoft Visual Studio

Microsoft Visual Studio is an integrated development environment (IDE)

from Microsoft. It can be used to develop console and graphical user

interface applications along with Windows Forms applications, web sites, web

applications, and web services in both native code together withmanaged

code for all platforms supported by Microsoft Windows, Windows

Mobile, Windows CE, .NET Framework, .NET Compact Frameworkand Microsoft

Silverlight.

Visual Studio includes a code editor supporting IntelliSense as well as code

refactoring. The integrated debugger works both as a source-level debugger and

a machine-level debugger. Other built-in tools include a forms designer for

building GUI applications, web designer, classdesigner, and database

schema designer. It accepts plug-ins that enhance the functionality at almost

Page 42: Text-To-Speech Technology-Based Programming Tool Final Doc

42every level—including adding support forsource-control systems

(like Subversion and Visual SourceSafe) and adding new toolsets like editors and

visual designers for domain-specific languages or toolsets for other aspects of

the software development lifecycle (like the Team Foundation Server client:

Team Explorer).

Visual Studio supports different programming languages by means of language

services, which allow the code editor and debugger to support (to varying

degrees) nearly any programming language, provided a language-specific

service exists. Built-in languages include C/C++ (via Visual C+

+), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual

Studio 2010[2]). Support for other languages such as M,Python, and Ruby among

others is available via language services installed separately. It also

supports XML/XSLT, HTML/XHTML, JavaScriptand CSS. Individual language-

specific versions of Visual Studio also exist which provide more limited language

services to the user: Microsoft Visual Basic, Visual J#, Visual C#, and Visual C+

+.

Microsoft provides "Express" editions of its Visual Studio 2010 components

Visual Basic, Visual C#, Visual C++, and Visual Web Developer at no cost.

Visual Studio 2010, 2008 and 2005 Professional Editions, along with language-

specific versions (Visual Basic, C++, C#, J#) of Visual Studio 2005 are available

Page 43: Text-To-Speech Technology-Based Programming Tool Final Doc

43for free to students as downloads via Microsoft's DreamSpark program. The 90-

day trial version of Visual Studio can be downloaded by the general public at no

cost.

Text-to-Speech Technology-Based Programming Tool

Page 44: Text-To-Speech Technology-Based Programming Tool Final Doc

44Introduction

According to the World Health Organization (WHO) globally, an estimated 40 to

45 million people are blind and 135 million have low vision [1]. In Australia over

480,000 Australians are vision impaired in both eyes, while over 50,000 are blind.

This number is expected to increase to more than 87,000 people within 20 years

[2]. Currently, there are screen reader tools such as JAWS [3], Brailliant Braille

[4] and Window-Eyes Screen Reader [5].

However, the costs for these tools are high and there is no tool that integrates

the environment for compiling and debugging programs. Furthermore, there is

not enough assistance for helping them learn to program in the leading edge

language C#. Blind programmers could compete in the IT industry when

infrastructure suited mainframes more [6]. These days, with all of computers in

the workplace, graphical windows applications are far more common. This

means that blind programmers are now at a competitive disadvantage in the

workplace and require special tools to be productive.

Blind and vision impaired people require two things to become programmers.

They need up to date knowledge of leading technology, and tools that meet

their own requirements [7]. This affects employment levels for blind and low

vision people. With the current unemployment rate for blind and vision impaired

at almost 70%, which is over four times the national average, specialized tools

could help a great deal of people [8]. Our research project is to design an audio

programming tool that meets specific needs of blind and vision impaired people

in learning C# programming language.

Page 45: Text-To-Speech Technology-Based Programming Tool Final Doc

45There are different forms of visual impairment, some people are blind from birth

or from a very early age, others lose their sight as a result of accidents, disease

or some affects of medication [10]. Therefore we concentrate on text-to-speech

technology and we assume that blind and vision impaired people are not hearing

impaired. The text-to-speech technology is used to make all components in the

programming tool voice enabled. Text and other graphics features such as

control size, location, and color that a normal vision user can see on the screen

will be spoken out by a speech synthesizer.

This tool has opened a great possibility that allows blind and vision impaired

users to become programmers in the future. Currently, blind and vision impaired

people have little access to current tools and assistance required for them to

learn programming languages. Our aim is to help them achieve equality of

access and opportunity in information technology education that will ensure

meaningful and equitable employment for their lives.

We have invited blind and vision impaired people to evaluate our programming

tool. Evaluations have shown that the tool can help them design and implement

programs effectively. Our research project can potentially impact the lives of blind

and low vision people. This coupled with the impending labor shortage, as the

baby boomers retire, means that anything that can give blind people an

opportunity to acquire practical, technical qualifications could greatly benefit blind

people and the whole economy. A tool that teaches programming is also a

Page 46: Text-To-Speech Technology-Based Programming Tool Final Doc

46programming tool and it can potentially give jobs to people who were previously

unemployable. Our research project will also impact software development

companies, governments, and educational institutions to develop software

packages, educational programs and policies that meet the needs of blind and

vision impaired people.

Current Applications and Projects

Optical character recognition and text-to-speech technologies are currently used

in software applicationsfor blind and vision impaired people. The first application

is for reading books or newspapers. The optical character recognition

technology is applied to scanners that scan text and read it aloud. Typical

devices for this application are Extreme Reader [11], Ovation and SARA

(Scanning and reading Appliance) [12] provide blind users access to printed

and electronic materials.

Those are converted from text to speech and read aloud. Kurzweil system scan

documents, store in files, and convert those to audio output [13]. Furthermore,

Optical Braille Recognition (OBR) allows a user to scan a Braille page and

convert it in to text [14]. This is a Windows software application to retrieve

information that can be presented as the text used in all types of Windows

applications. The Braille information in a small letter can be retrieved into

computer form in the same easy way. For reading text materials in computer, the

most popular software for blind users is JAWS [3].

Page 47: Text-To-Speech Technology-Based Programming Tool Final Doc

47

This software provides speech and Braille access to Windows operating system

and applications including Internet Explorer without the need of special

configurations. JAWS also provides a way to access Web pages. A research

project has been undertaken by Curtin University, Cisco Systems and the

Western Australia Association [10].

The project is to identify tools and techniques appropriate for vision impaired

students to study computer at tertiary level. This project recommends

improvements included the need for professional development for lecturers and

improved student access to electronic educational materials.

A computer education project recognized by Stockholm Challenge [15] is to

reduce the digital divide and provide education and learning tools in digital format

not available for the blind in Vietnam on paper support such as school books

newspapers and reading material.

This project aims to create a generation of blind computer users at different level

nationwide, and to provide a community place to acquire computer skills and

share information. However, there is no existing software application designed to

help blind and vision impaired people learn programming subjects in information

technology and engineering. This motivates us to design and implement a simple

yet efficient programming tool for blind and visual impaired users to develop

software applications. In the next section, we will present our proposed

Page 48: Text-To-Speech Technology-Based Programming Tool Final Doc

48programming tool and show how we can implement it.Testing and evaluation are

also presented.

Proposed Audio Programming Tool

It is seen that the more formats of material people can access, the higher their

employment opportunities are. There is a higher need for technical skills amongst

people who are blind or have low vision. Blind people require supporting tools

that meet their specific needs. The programming tool is designed not only for

blind users but also for vision impaired and normal vision users. The interface

should be designed in a way that complies W3C standards for vision impaired

users andshould be user friendly. The programming tool should be able to help a

blind user edit, save, compile, debug and run a program. Moreover, the tool

should have program templates and intellisense (auto-completion) options for

user convenience. In order to achieve these objectives, an iterative approach

was used. Each part was developed, tested then improved upon and tested

again.

This meant that usability issues were always found and improved. The tool has

been designed to provide voice for blind users and display suitable font,

font sizes and color scheme for vision impaired and normal vision people.

3.1 Audio Code Editor

A user starts editing a program or loading an existing program using audio code

editor. The program on the editor can be saved to a file or can be compiled,

Page 49: Text-To-Speech Technology-Based Programming Tool Final Doc

49debugged and run. For each character entered, the code editor can speak it out.

The user can use left, right, up and down arrow key to check any character in the

program by voice. Some of key requirements for the code editor are as follows:

• Tell the user whenever it is loaded or activated.

• Ask the user’s confirmation before it is closed; saving a file or opening a file.

• Tell the user the current line number.

• An option for the user to specify a line number and go to that line.

• Templates created in advance for every Console application and Windows

application.

• Speaks all characters on a line of code.

• For Windows Applications, the user will design the graphical user interface by

typing details (size, location, text, name, etc.) on the code editor. The code editor

will convert details to C# code and place the code to a file.

• Allow the user to write C# code for event handlers.

• Help the user write code quickly and correctly by speaking out properties,

classes, etc

Page 50: Text-To-Speech Technology-Based Programming Tool Final Doc

503.2 Audio Compiler and Debugger

The code compiler uses the C# software development toolkit (SDK) to compile

the program. However, to have voice output, we add code for voice accordingly

to the current program using a code modifier then use the C# SDK to compile the

modified program. For Console application, adding code for voice can be

performed by identifying code for text output then add code for voice accordingly.

For Windows program, adding code for voice is more complex. Mouse and key

event handlers will be added for the user to use mouse or keyboard to design a

Windows form. Voice will be output when a control on the Windows form is

focused to let the user know what the control is. The compiler also lets the user

know if the compilation is successful or if there is a compiling error.

When there is a compiling error it then tells the use that there are compiling

errors then reads out all the errors details, with the file name and line number. If

the user presses predefined shortcut keys, it stops reading, jumps to that line in

that file and reads that line to the user. The user can now fix the code and

presses the combination key to hear the next error if any.

3.3 Audio Output

The code compiler uses the C# software development toolkit (SDK) to compile

the program. However, to have voice output, we add code for voice accordingly

to the program before it is compiled. This is done for any program that provides

non-graphics or graphics output. Mouse or key event handlers will be added to

provide audio output when the user moves the mouse over a control or presses

the Tab key to focus on that control.

3.4 System Architecture

Page 51: Text-To-Speech Technology-Based Programming Tool Final Doc

51Figure 1 presents architectural design of the audio programming tool. C# and

text-to-speech software development toolkits (SDK) are used. User can start a

new project by choosing a template in a list of available templates. If the project

is a Windows application, then the user can use the built-in GUI builder to create

Windows controls by entering property values such as location, name, text, size,

etc. When the user writes code, the built-in code auto-completer will help user

write long class or method names.

When the user finishes the program and wants to compile and run it, the compiler

will analyze the program and add code to produce voice accordingly. The

modified program will be compiled and debugged. Errors if any will be output to a

file and the speech SDK will read out an error at a time and guide the user to the

line of code that contains the error in the program. This procedure will be

repeated until there is no error in the program and the C# SDK will run it. Voice

and text or graphics will be output and the user can use mouse or shortcut keys

to check the outputs.

Page 52: Text-To-Speech Technology-Based Programming Tool Final Doc

52

Page 53: Text-To-Speech Technology-Based Programming Tool Final Doc

53

It is noted that if the blind user save the project to files and run it in the normal

Visual Studio.NET, the output will be text or graphics only. Voice output is only

available if the user runs the project in the audio Studio.NET.

4 Testing and Evaluation

The proposed audio programming tool has been tested and evaluated by normal

vision users then by blind and vision impaired users. In the first test, normal

vision users were required not watching the computer monitor when they tested

the programming tool. It was observed that they were able to do all stages in

writing a program by listening to voices output from the tool. In the second

test, standard keyboards and built-in text-to-speech tools were used. We found

that vision impaired and blind users were also able to perform the same task.

However, vision impaired users were interested in applications with mouse and

blind users prefer those with keyboard. Most of blind and vision impaired people

are familiar with shortcut keys defined in JAWS, so adding new shortcut keys in

the programming tool is not recommended. Shortcut keys have been changed to

meet their specific needs. More programming lessons need to be provided to

help users be familiar with programming in .NET.

Page 54: Text-To-Speech Technology-Based Programming Tool Final Doc

54

5 Conclusion

We have presented our design and implementation of an audio programming tool

for blind and vision impaired people to learn programming in C#, a .NET

language. The programming tool was designed not only for blind and vision

impaired users but also for normal vision users. The programming tool was able

to help a blind user edit, save, compile, debug and run a program.

Moreover, the tool also had program templates and auto- completion options for

user convenience.

The tool has opened a great possibility that allows blind and vision impaired

users to become programmers in the future and to achieve equality of access

and opportunity in information technology education that will ensure meaningful

and equitable employment for their lives.

Page 55: Text-To-Speech Technology-Based Programming Tool Final Doc

55

References:

[1] World Health Organization (2003). Retrieved from

http://www.who.int/mediacentre/news/releases/200

3/pr73/en/

[2] Access Economics (2004) Clear Insight: The

Economic Impact and Cost of Vision Loss in

Australia http://www.bca.org.au/natpol/statistics/

[3] JAWS (2007), retrieved from the following site

http://www.freedomscientific.com/fs_products/soft

ware_jaws.asp

[4] Brailliant Braille (2007), retrieved from the site

http://humanware.ca/web/en/p_OP_Brailliant.asp

[5] Window-Eyes Screen Reader http://www.tandt-

consultancy.com/window_eyes.html

[6] Alexander Steve, (1998) Blind programmers face

an uncertain future. Retrieved from the CNN:

http://www.cnn.com/TECH/computing/9811/06/bli

ndprog.idg/index.html

[7] Elkes, J. G. (1982) Designing Software for Blind

Programmers. Public Utilities Commission of

Ohio. Retrieved from an online article repository:

http://delivery.acm.org/10.1145/970000/964173/p1

Page 56: Text-To-Speech Technology-Based Programming Tool Final Doc

56

5-

elkes.pdf?key1=964173&key2=4640659711&coll=

GUIDE&dl=GUIDE&CFID=22945606&CFTOKE

N=95515984.

[8] Vision Australia (2007) Results and Observations

from Research into Employment Levels in

Australia. Retrieved from the following site

http://www.visionaustralia.org.au/docs/news_event

s/Employment_Overview.doc.

[9] Kopecek & Jergova (1998) Programming and

visually impaired people, in Proceedings of

ICCHP’98, Wien-Budapest.

[10] Ian Murray and Helen Amstrong (2004) “A

Computing Education Vision for the Sight

Impaired”, in Proceedings of the sixth Australasian

Computing Education Conference.

[11] Extreme Reader, retrieved from the following web

site http://www.brailler.com/extrdr.htm

Page 57: Text-To-Speech Technology-Based Programming Tool Final Doc

57

[12] Ovation and SARA, retrieved from the web site

http://www.abledata.com/abledata.cfm?pageid=193

27&ksectionid=19327&top=13293

[13] Kurzweil Education System. Retrieved from the

web site http://www.kurzweiledu.com/

[14] Optical Braille Recognition, retrieved from the web

site http://www.neovision.cz/prods/obr/

[15] Computer Education for blind people. Retrieved

http://www.stockholmchallenge.se/data/computer_

education_and_it