text-to-speech technology-based programming tool final doc
DESCRIPTION
TEXT TO SPEECH CONVERSIONTRANSCRIPT
1
Text-to-Speech
Technology-Based
Programming Tool
2
ABSTRACT
This paper presents an audio programming tool based on text-to-speech
technology for blind and vision impaired people to learn programming. The tool
can help users edit a program then compile, debug and run it. All of these stages
are voice enabled. The programming language for evaluation is C# and the tool
is developed in Visual Studio .NET. Evaluations have shown that the
programming tool can help blind and vision impaired people implement software
applications and achieve equality of access and opportunity in information
technology education.
3
Introduction
Blindness is the condition of lacking visual perception due
to physiological or neurological factors.
Various scales have been developed to describe the extent of vision loss and
define blindness.[1] Total blindness is the complete lack of form and visual light
perception and is clinically recorded as NLP, an abbreviation for "no light
perception."[1] Blindness is frequently used to describe severe visual
impairment with residual vision. Those described as having only light perception
have no more sight than the ability to tell light from dark and the general direction
of a light source.
In order to determine which people may need special assistance because of their
visual disabilities, various governmental jurisdictions have formulated more
complex definitions referred to as legal blindness.[2] In North America and most
of Europe, legal blindness is defined as visual acuity (vision) of 20/200 (6/60) or
less in the better eye with best correction possible. This means that a legally
blind individual would have to stand 20 feet (6.1 m) from an object to see it—
with corrective lenses—with the same degree of clarity as a normally sighted
person could from 200 feet (61 m). In many areas, people with average acuity
who nonetheless have a visual field of less than 20 degrees (the norm being 180
degrees) are also classified as being legally blind. Approximately ten percent of
those deemed legally blind, by any measure, have no vision.
4
The rest have some vision, from light perception alone to relatively good
acuity. Low vision is sometimes used to describe visual acuities from 20/70 to
20/200.[3]
By the 10th Revision of the WHO International Statistical Classification of
Diseases, Injuries and Causes of Death, low vision is defined as visual acuity of
less than 20/60 (6/18), but equal to or better than 20/200 (6/60), or corresponding
visual field loss to less than 20 degrees, in the better eye with best possible
correction. Blindness is defined as visual acuity of less than 20/400 (6/120), or
corresponding visual field loss to less than 10 degrees, in the better eye with best
possible correction.[4][5]
Blind people with undamaged eyes may still register light non-visually for the
purpose of circadian entrainment to the 24-hour light/dark cycle. Light signals for
this purpose travel through the retinohypothalamic tract, so a damaged optic
nerve beyond where the retinohypothalamic tract exits it is no hindrance
Causes
Serious visual impairment has a variety of causes:
5
Diseases
According to WHO estimates, the most common causes of blindness around the
world in 2002 were:
1. cataracts (47.9%),
2. glaucoma (12.3%),
3. age-related macular degeneration (8.7%),
4. corneal opacity (5.1%), and
5. diabetic retinopathy (4.8%),
6. childhood blindness (3.9%),
7. trachoma (3.6%)
8. onchocerciasis (0.8%).[13]
9.
In terms of the worldwide prevalence of blindness, the vastly greater number of
people in the developing world and the greater likelihood of their being affected
mean that the causes of blindness in those areas are numerically more
important. Cataract is responsible for more than 22 million cases of blindness
and glaucoma 6 million, while leprosy and onchocerciasis each blind
approximately 1 million individuals worldwide. The number of individuals blind
from trachoma has dropped dramatically in the past 10 years from 6 million to 1.3
million, putting it in seventh place on the list of causes of blindness worldwide.
Xerophthalmia is estimated to affect 5 million children each year; 500,000
develop active corneal involvement, and half of these go blind. Central corneal
ulceration is also a significant cause of monocular blindness worldwide,
accounting for an estimated 850,000 cases of corneal blindness every year in the
Indian subcontinent alone. As a result, corneal scarring from all causes now is
6the fourth greatest cause of global blindness (Vaughan & Asbury's General
Ophthalmology, 17e)
People in developing countries are significantly more likely to experience visual
impairment as a consequence of treatable or preventable conditions than are
their counterparts in the developed world. While vision impairment is most
common in people over age 60 across all regions, children in poorer communities
are more likely to be affected by blinding diseases than are their more affluent
peers.
The link between poverty and treatable visual impairment is most obvious when
conducting regional comparisons of cause. Most adult visual impairment in North
America and Western Europe is related to age-related macular degeneration and
diabetic retinopathy. While both of these conditions are subject to treatment,
neither can be cured.
In developing countries, wherein people have shorter life expectancies, cataracts
and water-borne parasites—both of which can be treated effectively—are most
often the culprits (see river blindness, for example). Of the estimated 40 million
blind people located around the world, 70–80% can have some or all of their
sight restored through treatment.
In developed countries where parasitic diseases are less common and cataract
surgery is more available, age-related macular degeneration, glaucoma, and
diabetic retinopathy are usually the leading causes of blindness.[14]
7Childhood blindness can be caused by conditions related to pregnancy, such
as congenital rubella syndrome and retinopathy of prematurity.
Abnormalities and injuries
Eye injuries, most often occurring in people under 30, are the leading cause of
monocular blindness (vision loss in one eye) throughout the United States.
Injuries and cataracts affect the eye itself, while abnormalities such as optic
nerve hypoplasia affect the nerve bundle that sends signals from the eye to the
back of the brain, which can lead to decreased visual acuity.
People with injuries to the occipital lobe of the brain can, despite having
undamaged eyes and optic nerves, still be legally or totally blind.
Genetic defects
People with albinism often have vision loss to the extent that many are legally
blind, though few of them actually cannot see. Leber's congenital amaurosis can
cause total blindness or severe sight loss from birth or early childhood.
Recent advances in mapping of the human genome have identified other genetic
causes of low vision or blindness. One such example is Bardet-Biedl syndrome.
Poisoning
8Rarely, blindness is caused by the intake of certain chemicals. A well-known
example is methanol, which is only mildly toxic and minimally intoxicating, but
when not competing with ethanol for metabolism, methanol breaks down into the
substances formaldehyde and formic acid which in turn can cause blindness, an
array of other health complications, and death.[15] Methanol is commonly found
in methylated spirits, denatured ethyl alcohol, to avoid paying taxes on selling
ethanol intended for human consumption. Methylated spirits are sometimes used
by alcoholics as a desperate and cheap substitute for regular ethanol alcoholic
beverages.
Willful actions
Blinding has been used as an act of vengeance and torture in some instances, to
deprive a person of a major sense by which they can navigate or interact within
the world, act fully independently, and be aware of events surrounding them. An
example from the classical realm is Oedipus, who gouges out his own eyes after
realizing that he fulfilled the awful prophecy spoken of him.
In 2003, a Pakistani anti-terrorism court sentenced a man to be blinded after he
carried out an acid attack against his fiancee that resulted in her blinding.[16] The
same sentence was given in 2009 for the man who blinded Ameneh Bahrami.
comorbidities
9Blindness can occur in combination with such conditions as mental
retardation, autism, cerebral palsy, hearing impairments, and epilepsy.[17][18] In a
study of 228 visually impaired children inmetropolitan Atlanta between 1991 and
1993, 154 (68%) had an additional disability besides visual impairment.[17] Blindness in combination with hearing loss is known as deafblindness.
Management
A 2008 study published in the New England Journal of Medicine[19] tested the
effect of using gene therapy to help restore the sight of patients with a rare form
of inherited blindness, known as Leber Congenital Amaurosis or LCA. Leber
Congenital Amaurosis damages the light receptors in the retina and usually
begins affecting sight in early childhood, with worsening vision until complete
blindness around the age of 30.
The study used a common cold virus to deliver a normal version of the gene
called RPE65 directly into the eyes of affected patients. Remarkably all 3 patients
aged 19, 22 and 25 responded well to the treatment and reported improved
vision following the procedure. Due to the age of the patients and the
degenerative nature of LCA the improvement of vision in gene therapy patients is
encouraging for researchers. It is hoped that gene therapy may be even more
effective in younger LCA patients who have experienced limited vision loss as
well as in other blind or partially blind individuals.
Two experimental treatments for retinal problems include a cybernetic
replacement and transplant of fetal retinal cells.[20]
10
Adaptive techniques and aids
Mobility
Folded long cane.
Many people with serious visual impairments can travel independently, using a
wide range of tools and techniques. Orientation and mobility specialists are
professionals who are specifically trained to teach people with visual impairments
how to travel safely, confidently, and independently in the home and the
community. These professionals can also help blind people to practice travelling
11on specific routes which they may use often, such as the route from one's house
to a convenience store. Becoming familiar with an environment or route can
make it much easier for a blind person to navigate successfully.
Tools such as the white cane with a red tip - the international symbol of blindness
- may also be used to improve mobility. A long cane is used to extend the user's
range of touch sensation. It is usually swung in a low sweeping motion, across
the intended path of travel, to detect obstacles.
However, techniques for cane travel can vary depending on the user and/or the
situation. Some visually impaired persons do not carry these kinds of canes,
opting instead for the shorter, lighter identification (ID) cane. Still others require a
support cane. The choice depends on the individual's vision, motivation, and
other factors.
A small number of people employ guide dogs to assist in mobility. These dogs
are trained to navigate around various obstacles, and to indicate when it
becomes necessary to go up or down a step. However, the helpfulness of guide
dogs is limited by the inability of dogs to understand complex directions. The
human half of the guide dog team does the directing, based upon skills acquired
through previous mobility training. In this sense, the handler might be likened to
an aircraft's navigator, who must know how to get from one place to another, and
the dog to the pilot, who gets them there safely.
In addition, some blind people use software using GPS technology as a mobility
aid. Such software can assist blind people with orientation and navigation, but it
12is not a replacement for traditional mobility tools such as white canes and guide
dogs.
Government actions are sometimes taken to make public places more accessible
to blind people. Public transportation is freely available to the blind in many
cities. Tactile paving and audible traffic signals can make it easier and safer for
visually impaired pedestrians to cross streets. In addition to making rules about
who can and cannot use a cane, some governments mandate the right-of-way be
given to users of white canes or guide dogs.
Reading and magnification
Watch for the blind
13Most visually impaired people who are not totally blind read print, either of a
regular size or enlarged by magnification devices. Many also read large-print,
which is easier for them to read without such devices. A variety of magnifying
glasses, some handheld, and some on desktops, can make reading easier for
them.
Others read Braille (or the infrequently used Moon type), or rely on talking
books and readers or reading machines, which convert printed text to speech
orBraille. They use computers with special hardware such
as scanners and refreshable Braille displays as well as software written
specifically for the blind, such as optical character recognition applications
and screen readers.
Some people access these materials through agencies for the blind, such as
the National Library Service for the Blind and Physically Handicapped in the
United States, the National Library for the Blind or the RNIB in the United
Kingdom.
Closed-circuit televisions, equipment that enlarges and contrasts textual items,
are a more high-tech alternative to traditional magnification devices.
There are also over 100 radio reading services throughout the world that provide
people with vision impairments with readings from periodicals over the radio. The
International Association of Audio Information Services provides links to all of
these organizations.
14Computers
Access technology such as screen readers, screen magnifiers and refreshable
Braille displays enable the blind to use mainstream computer applications
andmobile phones. The availability of assistive technology is increasing,
accompanied by concerted efforts to ensure the accessibility of information
technology to all potential users, including the blind. Later versions of Microsoft
Windows include an Accessibility Wizard & Magnifier for those with partial vision,
andMicrosoft Narrator, a simple screen reader. Linux distributions (as live CDs)
for the blind include Oralux and Adriane Knoppix, the latter developed in part
byAdriane Knopper who has a visual impairment. Mac OS also comes with a
built-in screen reader, called VoiceOver.
The movement towards greater web accessibility is opening a far wider number
of websites to adaptive technology, making the web a more inviting place for
visually impaired surfers.
Experimental approaches in sensory substitution are beginning to provide access
to arbitrary live views from a camera.
Other aids and techniques
15
A tactile feature on a Canadian banknote.
Blind people may use talking equipment such as thermometers, watches,
clocks, scales, calculators, and compasses. They may also enlarge or mark dials
on devices such as ovens and thermostats to make them usable. Other
techniques used by blind people to assist them in daily activities include:
Adaptations of coins and banknotes so that the value can be determined by touch.
For example:
In some currencies, such as the euro, the pound sterling and the Indian rupee, the
size of a note increases with its value.
On US coins, pennies and dimes, and nickels and quarters are similar in size. The
larger denominations (dimes and quarters) have ridges along the sides
(historically used to prevent the "shaving" of precious metals from the coins),
which can now be used for identification.
16
Epidemiology
The WHO estimates that in 2002 there were 161 million visually impaired people
in the world (about 2.6% of the total population). Of this number 124 million
(about 2%) had low vision and 37 million (about 0.6%) were blind.[22] In order of
frequency the leading causes were cataract, uncorrected refractive errors (near
sighted, far sighted, or an astigmatism), glaucoma, and age-related macular
degeneration.[23] In 1987, it was estimated that 598,000 people in the United
States met the legal definition of blindness.[24] Of this number, 58% were over the
age of 65.[24] In 1994-1995, 1.3 million Americans reported legal blindness.[25]
17
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer
system used for this purpose is called a speech synthesizer, and can be
implemented in software or hardware. A text-to-speech (TTS) system converts
normal language text into speech; other systems render symbolic linguistic
representations like phonetic transcriptions into speech.[1]
Synthesized speech can be created by concatenating pieces of recorded speech
that are stored in a database. Systems differ in the size of the stored speech
units; a system that stores phones ordiphones provides the largest output
range, but may lack clarity.
For specific usage domains, the storage of entire words or sentences allows for
high-quality output. Alternatively, a synthesizer can incorporate a model of
the vocal tract and other human voice characteristics to create a completely
"synthetic" voice output.[2]
The quality of a speech synthesizer is judged by its similarity to the human voice
and by its ability to be understood. An intelligible text-to-speech program allows
people with visual impairments orreading disabilities to listen to written works
on a home computer. Many computer operating systems have included speech
synthesizers since the early 1980s.
18Overview of text processing
Overview of a typical TTS system
A text-to-speech system (or "engine") is composed of two parts[3]: a front-end and
a back-end. The front-end has two major tasks. First, it converts raw text
containing symbols like numbers and abbreviations into the equivalent of written-
out words. This process is often called text normalization, pre-processing,
ortokenization. The front-end then assigns phonetic transcriptions to each word,
and divides and marks the text into prosodic units, like phrases, clauses,
andsentences. The process of assigning phonetic transcriptions to words is
called text-to-phoneme or grapheme-to-phoneme conversion.
Phonetic transcriptions and prosody information together make up the symbolic
linguistic representation that is output by the front-end. The back-end—often
referred to as thesynthesizer—then converts the symbolic linguistic
representation into sound. In certain systems, this part includes the computation
of the target prosody(pitch contour, phoneme durations[4]), which is then imposed
on the output speech.
19History
Long before electronic signal processing was invented, there were those who
tried to build machines to create human speech. Some early legends of the
existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus
Magnus (1198–1280), and Roger Bacon (1214–1294).
In 1779, the Danish scientist Christian Kratzenstein, working at the Russian
Academy of Sciences, built models of the human vocal tract that could produce
the five long vowel sounds (in International Phonetic Alphabet notation, they
are [aː], [eː], [iː], [oː] and [uː]).[5] This was followed by the bellows-operated
"acoustic-mechanical speech machine" by Wolfgang von
Kempelen of Vienna, Austria, described in a 1791 paper.[6] This machine added
models of the tongue and lips, enabling it to produce consonants as well as
vowels. In 1837,Charles Wheatstone produced a "speaking machine" based on
von Kempelen's design, and in 1857, M. Faber built the "Euphonia".
Wheatstone's design was resurrected in 1923 by Paget.[7]
In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated
electronic speech analyzer and synthesizer that was said to be clearly
intelligible. Homer Dudley refined this device into the VODER, which he exhibited
at the 1939 New York World's Fair.
20The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues
at Haskins Laboratories in the late 1940s and completed in 1950. There were
several different versions of this hardware device but only one currently survives.
The machine converts pictures of the acoustic patterns of speech in the form of a
spectrogram back into sound. Using this device, Alvin Liberman and colleagues
were able to discover acoustic cues for the perception of phonetic segments
(consonants and vowels).
Dominant systems in the 1980s and 1990s were the MITalk system, based
largely on the work of Dennis Klatt at MIT, and the Bell Labs system;[8] the latter
was one of the first multilingual language-independent systems, making
extensive use of Natural Language Processing methods.
Early electronic speech synthesizers sounded robotic and were often barely
intelligible. The quality of synthesized speech has steadily improved, but output
from contemporary speech synthesis systems is still clearly distinguishable from
actual human speech.
As the cost-performance ratio causes speech synthesizers to become cheaper
and more accessible to the people, more people will benefit from the use of text-
to-speech programs.[9]
21Electronic devices
The first computer-based speech synthesis systems were created in the late
1950s, and the first complete text-to-speech system was completed in 1968. In
1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman[10] used
an IBM 704 computer to synthesize speech, an event among the most prominent
in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated
the song "Daisy Bell", with musical accompaniment from Max Mathews.
Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce
at the Bell Labs Murray Hill facility. Clarke was so impressed by the
demonstration that he used it in the climactic scene of his screenplay for his
novel 2001: A Space Odyssey,[11] where the HAL 9000 computer sings the same
song as it is being put to sleep by astronaut Dave Bowman.[12] Despite the
success of purely electronic speech synthesis, research is still being conducted
into mechanical speech synthesizers.[13]
Handheld electronics featuring speech synthesis began emerging in the 1970s.
One of the first was the Telesensory Systems Inc. (TSI) Speech+ portable
calculator for the blind in 1976.[14][15] Other devices were produced primarily for
educational purposes, such as Speak & Spell, produced by Texas
Instruments[16] in 1978. The first multi-player game using voice synthesis
was Milton from Milton Bradley Company, which produced the device in 1980.
22Synthesizer technologies
The most important qualities of a speech synthesis system
are naturalness and intelligibility. Naturalness describes how closely the output
sounds like human speech, while intelligibility is the ease with which the output is
understood. The ideal speech synthesizer is both natural and intelligible. Speech
synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms
are concatenative synthesis and formant synthesis. Each technology has
strengths and weaknesses, and the intended uses of a synthesis system will
typically determine which approach is used.
Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of
segments of recorded speech. Generally, concatenative synthesis produces the
most natural-sounding synthesized speech. However, differences between
natural variations in speech and the nature of the automated techniques for
segmenting the waveforms sometimes result in audible glitches in the output.
There are three main sub-types of concatenative synthesis.
Unit selection synthesis
23Unit selection synthesis uses large databases of recorded speech. During
database creation, each recorded utterance is segmented into some or all of the
following: individual phones, diaphones, half-
phones, syllables, morphemes, words, phrases, and sentences. Typically, the
division into segments is done using a specially modified speech recognizer set
to a "forced alignment" mode with some manual correction afterward, using
visual representations such as the waveform and spectrogram.[17] An index of the
units in the speech database is then created based on the segmentation and
acoustic parameters like the fundamental frequency (pitch), duration, position in
the syllable, and neighboring phones. At runtime, the desired target utterance is
created by determining the best chain of candidate units from the database (unit
selection). This process is typically achieved using a specially weighted decision
tree.
Unit selection provides the greatest naturalness, because it applies only a small
amount of digital signal processing (DSP) to the recorded speech. DSP often
makes recorded speech sound less natural, although some systems use a small
amount of signal processing at the point of concatenation to smooth the
waveform. The output from the best unit-selection systems is often
indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require
unit-selection speech databases to be very large, in some systems ranging into
the gigabytes of recorded data, representing dozens of hours of speech.[18] Also,
unit selection algorithms have been known to select segments from a place that
24results in less than ideal synthesis (e.g. minor words become unclear) even when
a better choice exists in the database.[19]
Diaphone synthesis
Diphone synthesis uses a minimal speech database containing all
the diphones (sound-to-sound transitions) occurring in a language. The number
of diphones depends on the phonotactics of the language: for example, Spanish
has about 800 diphones, and German about 2500. In diphone synthesis, only
one example of each diphone is contained in the speech database. At runtime,
the targetprosody of a sentence is superimposed on these minimal units by
means of digital signal processing techniques such as linear predictive
coding, PSOLA[20] or MBROLA.[21] The quality of the resulting speech is generally
worse than that of unit-selection systems, but more natural-sounding than the
output of formant synthesizers. Diphone synthesis suffers from the sonic glitches
of concatenative synthesis and the robotic-sounding nature of formant synthesis,
and has few of the advantages of either approach other than small size. As such,
its use in commercial applications is declining, although it continues to be used in
research because there are a number of freely available software
implementations.
Domain-specific synthesis
25Domain-specific synthesis concatenates prerecorded words and phrases to
create complete utterances. It is used in applications where the variety of texts
the system will output is limited to a particular domain, like transit schedule
announcements or weather reports.[22] The technology is very simple to
implement, and has been in commercial use for a long time, in devices like
talking clocks and calculators. The level of naturalness of these systems can be
very high because the variety of sentence types is limited, and they closely
match the prosody and intonation of the original recordings.[citation needed]
Because these systems are limited by the words and phrases in their databases,
they are not general-purpose and can only synthesize the combinations of words
and phrases with which they have been preprogrammed. The blending of words
within naturally spoken language however can still cause problems unless the
many variations are taken into account. For example, in non-rhotic dialects of
English the "r" in words like "clear" /ˈkliːə/ is usually only pronounced when the
following word has a vowel as its first letter (e.g. "clear out" is realized as /ˌkliːəɹ
ˈɑʊt/). Likewise in French, many final consonants become no longer silent if
followed by a word that begins with a vowel, an effect called liaison.
This alternation cannot be reproduced by a simple word-concatenation system,
which would require additional complexity to be context-sensitive.
Formant synthesis
26Formant synthesis does not use human speech samples at runtime. Instead, the
synthesized speech output is created using additive synthesis and an acoustic
model (physical modelling synthesis).[23] Parameters such as fundamental
frequency, voicing, and noise levels are varied over time to create a waveform of
artificial speech. This method is sometimes called rules-based synthesis;
however, many concatenative systems also have rules-based components. Many
systems based on formant synthesis technology generate artificial, robotic-
sounding speech that would never be mistaken for human speech. However,
maximum naturalness is not always the goal of a speech synthesis system, and
formant synthesis systems have advantages over concatenative systems.
Formant-synthesized speech can be reliably intelligible, even at very high
speeds, avoiding the acoustic glitches that commonly plague concatenative
systems. High-speed synthesized speech is used by the visually impaired to
quickly navigate computers using a screen reader. Formant synthesizers are
usually smaller programs than concatenative systems because they do not have
a database of speech samples. They can therefore be used in embedded
systems, where memory and microprocessor power are especially limited.
Because formant-based systems have complete control of all aspects of the
output speech, a wide variety of prosodies and intonations can be output,
conveying not just questions and statements, but a variety of emotions and tones
of voice.
Examples of non-real-time but highly accurate intonation control in formant
synthesis include the work done in the late 1970s for the Texas
27Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines.[24] and in many Atari, Inc. arcade games[25] using the TMS5220 LPC Chips.
Creating proper intonation for these projects was painstaking, and the results
have yet to be matched by real-time text-to-speech interfaces.[26]
Articulatory synthesis
Articulatory synthesis refers to computational techniques for synthesizing speech
based on models of the human vocal tract and the articulation processes
occurring there. The first articulatory synthesizer regularly used for laboratory
experiments was developed at Haskins Laboratories in the mid-1970s by Philip
Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was
based on vocal tract models developed at Bell Laboratories in the 1960s and
1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into
commercial speech synthesis systems. A notable exception is the NeXT-based
system originally developed and marketed by Trillium Sound Research, a spin-off
company of the University of Calgary, where much of the original research was
conducted. Following the demise of the various incarnations of NeXT (started
bySteve Jobs in the late 1980s and merged with Apple Computer in 1997), the
Trillium software was published under the GNU General Public License, with
work continuing as gnu speech.
28 The system, first marketed in 1994, provides full articulatory-based text-to-
speech conversion using a waveguide or transmission-line analog of the human
oral and nasal tracts controlled by Carré's "distinctive region model".
HMM-based synthesis
HMM-based synthesis is a synthesis method based on hidden Markov models,
also called Statistical Parametric Synthesis. In this system, the frequency
spectrum (vocal tract), fundamental frequency(vocal source), and duration
(prosody) of speech are modeled simultaneously by HMMs.
Speech waveforms are generated from HMMs themselves based on
the maximum likelihood criterion.[27]
Sine wave synthesis
Sine wave synthesis is a technique for synthesizing speech by replacing
the formants (main bands of energy) with pure tone whistles.[28]
Challenges
Text normalization challenges
The process of normalizing text is rarely straightforward. Texts are full
of heteronyms, numbers, and abbreviations that all require expansion into a
phonetic representation.
29There are many spellings in English which are pronounced differently based on
context. For example, "My latest project is to learn how to better project my
voice" contains two pronunciations of "project".
Most text-to-speech (TTS) systems do not generate semantic representations of
their input texts, as processes for doing so are not reliable, well understood, or
computationally effective. As a result, various heuristic techniques are used to
guess the proper way to disambiguate homographs, like examining neighboring
words and using statistics about frequency of occurrence.
Recently TTS systems have begun to use HMMs (discussed above) to generate
"parts of speech" to aid in disambiguating homographs. This technique is quite
successful for many cases such as whether "read" should be pronounced as
"red" implying past tense, or as "reed" implying present tense. Typical error rates
when using HMMs in this fashion are usually below five percent. These
techniques also work well for most European languages, although access to
required training corpora is frequently difficult in these languages.
Deciding how to convert numbers is another problem that TTS systems have to
address. It is a simple programming challenge to convert a number into words (at
least in English), like "1325" becoming "one thousand three hundred twenty-five."
However, numbers occur in many different contexts; "1325" may also be read as
"one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five".
30 A TTS system can often infer how to expand a number based on surrounding
words, numbers, and punctuation, and sometimes the system provides a way to
specify the context if it is ambiguous.[29] Roman numerals can also be read
differently depending on context. For example "Henry VIII" reads as "Henry the
Eighth", while "Chapter VIII" reads as "Chapter Eight".
Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for
"inches" must be differentiated from the word "in", and the address "12 St John
St." uses the same abbreviation for both "Saint" and "Street". TTS systems with
intelligent front ends can make educated guesses about ambiguous
abbreviations, while others provide the same result in all cases, resulting in
nonsensical (and sometimes comical) outputs.
Text-to-phoneme challenges
Speech synthesis systems use two basic approaches to determine the
pronunciation of a word based on its spelling, a process which is often called
text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term
used by linguists to describe distinctive sounds in a language). The simplest
approach to text-to-phoneme conversion is the dictionary-based approach, where
a large dictionary containing all the words of a language and their correct
pronunciations is stored by the program. Determining the correct pronunciation of
each word is a matter of looking up each word in the dictionary and replacing the
spelling with the pronunciation specified in the dictionary.
31The other approach is rule-based, in which pronunciation rules are applied to
words to determine their pronunciations based on their spellings. This is similar
to the "sounding out", or synthetic phonics, approach to learning reading.
Each approach has advantages and drawbacks. The dictionary-based approach
is quick and accurate, but completely fails if it is given a word which is not in its
dictionary.[citation needed] As dictionary size grows, so too does the memory space
requirements of the synthesis system. On the other hand, the rule-based
approach works on any input, but the complexity of the rules grows substantially
as the system takes into account irregular spellings or pronunciations. (Consider
that the word "of" is very common in English, yet is the only word in which the
letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use
a combination of these approaches.
Languages with a phonemic orthography have a very regular writing system, and
the prediction of the pronunciation of words based on their spellings is quite
successful. Speech synthesis systems for such languages often use the rule-
based method extensively, resorting to dictionaries only for those few words, like
foreign names and borrowings, whose pronunciations are not obvious from their
spellings. On the other hand, speech synthesis systems for languages
like English, which have extremely irregular spelling systems, are more likely to
rely on dictionaries, and to use rule-based methods only for unusual words, or
words that aren't in their dictionaries.
Evaluation challenges
32The consistent evaluation of speech synthesis systems may be difficult because
of a lack of universally agreed objective evaluation criteria. Different
organizations often use different speech data. The quality of speech synthesis
systems also depends to a large degree on the quality of the production
technique (which may involve analogue or digital recording) and on the facilities
used to replay the speech. Evaluating speech synthesis systems has therefore
often been compromised by differences between production techniques and
replay facilities.
Recently, however, some researchers have started to evaluate speech synthesis
systems using a common speech dataset.[30]
Prosodics and emotional content
A recent study reported in the journal "Speech Communication" by Amy
Drahota and colleagues at the University of Portsmouth, UK, reported that
listeners to voice recordings could determine, at better than chance levels,
whether or not the speaker was smiling.[31] It was suggested that identification of
the vocal features which signal emotional content may be used to help make
synthesized speech sound more natural.
33Dedicated hardware
Votrax
SC-01A (analog formant)
SC-02 / SSI-263 / "Arctic 263"
General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
Magnevation SpeakJet (www.speechchips.com TTS256)
Savage Innovations SoundGin
National Semiconductor DT1050 Digitalker (Mozer)
Silicon Systems SSI 263 (analog formant)
Texas Instruments LPC Speech Chips
TMS5110A
TMS5200
Oki Semiconductor
ML22825 (ADPCM)
ML22573 (HQADPCM)
Toshiba T6721A
Philips PCF8200
TextSpeak Embedded TTS Modules
Computer operating systems or outlets with speech synthesis
34Atari
Arguably, the first speech system integrated into an operating system was the
1400XL/1450XL personal computers designed by Atari, Inc. using the Votrax
SC01 chip in 1983. The 1400XL/1450XL computers used a Finite State Machine
to enable World English Spelling text-to-speech synthesis.[32] Unfortunately, the
1400XL/1450XL personal computers never shipped in quantity.
The Atari ST computers were sold with "stspeech.tos" on floppy disk.
Apple
The first speech system integrated into an operating system that shipped in
quantity was Apple Computer's MacInTalk in 1984. Since the 1980s Macintosh
Computers offered text to speech capabilities through The MacinTalk software. In
the early 1990s Apple expanded its capabilities offering system wide text-to-
speech support. With the introduction of faster PowerPC-based computers they
included higher quality voice sampling. Apple also introduced speech
recognition into its systems which provided a fluid command set. More recently,
Apple has added sample-based voices. Starting as a curiosity, the speech
system of Apple Macintosh has evolved into a fully-supported
program, PlainTalk, for people with vision problems. VoiceOver was for the first
time featured in Mac OS X Tiger (10.4).
During 10.4 (Tiger) & first releases of 10.5 (Leopard) there was only one
standard voice shipping with Mac OS X. Starting with 10.6 (Snow Leopard), the
35user can choose out of a wide range list of multiple voices. VoiceOver voices
feature the taking of realistic-sounding breaths between sentences, as well as
improved clarity at high read rates over PlainTalk. Mac OS X also includessay,
a command-line based application that converts text to audible speech.
The AppleScript Standard Additions includes a say verb that allows a script to
use any of the installed voices and to control the pitch, speaking rate and
modulation of the spoken text.
AmigaOS
The second operating system with advanced speech synthesis capabilities
was AmigaOS, introduced in 1985. The voice synthesis was licensed
by Commodore International from a third-party software house (Don't Ask
Software, now Softvoice, Inc.) and it featured a complete system of voice
emulation, with both male and female voices and "stress" indicator markers,
made possible by advanced features of the Amiga hardware audio chipset.[33] It
was divided into a narrator device and a translator library. Amiga Speak
Handler featured a text-to-speech translator. AmigaOS considered speech
synthesis a virtual hardware device, so the user could even redirect console
output to it. Some Amiga programs, such as word processors, made extensive
use of the speech system.
Microsoft Windows
Modern Windows systems use SAPI4- and SAPI5-based speech systems that
include a speech recognition engine (SRE). SAPI 4.0 was available on Microsoft-
36based operating systems as a third-party add-on for systems like Windows
95 and Windows 98. Windows 2000 added a speech synthesis program
called Narrator, directly available to users. All Windows-compatible programs
could make use of speech synthesis features, available through menus once
installed on the system. Microsoft Speech Server is a complete package for voice
synthesis and recognition, for commercial applications such as call centers.
Text-to-Speech (TTS) capabilities for a computer refers to the ability to play
back text in a spoken voice. TTS is the ability of the operating system to play
back printed text as spoken words.[34]
An internal (installed with the operating system) driver (called a TTS engine):
recognizes the text and using a synthesized voice (chosen from several pre-
generated voices) speaks the written text. Additional engines (often use a certain
jargon or vocabulary) are also available through third-party manufacturers.[34]
Android
Version 1.6 of Android added support for speech synthesis (TTS).[35]
Internet
The most recent TTS development in the web browser, is the JavaScript Text to
Speech work of Yury Delendik, which ports the Flite C engine to pure JavaScript.
This allows web pages to convert text to audio using HTML5 technology. The
ability to use Yury's TTS port currently requires a custom browser build that uses
Mozilla's Audio-Data-API. However, much work is being done in the context of
the W3C to move this technology into the mainstream browser market through
the W3C Audio Incubator Group with the involvement of The BBC and Google
Inc.
Currently, there are a number of applications, plugging and gadgets that can
read messages directly from an e-mail client and web pages from a web
browser or Google Toolbar such as voice which is an add-on to Firefox . Some
specialized software can narrate RSS-feeds. On one hand, online RSS-narrators
simplify information delivery by allowing users to listen to their favorite news
37sources and to convert them to podcasts. On the other hand, on-line RSS-
readers are available on almost any PC connected to the Internet. Users can
download generated audio files to portable devices, e.g. with a help
of podcast receiver, and listen to them while walking, jogging or commuting to
work.
A growing field in internet based TTS is web-based assistive technology, e.g.
'Browsealoud' from a UK company and Read speaker. It can deliver TTS
functionality to anyone (for reasons of accessibility, convenience, entertainment
or information) with access to a web browser. The non-
profit project Pediaphon was created in 2006 to provide a similar web-based TTS
interface to the Wikipedia.[36]Additionally SPEAK.TO.ME from Oxford Information
Laboratories is capable of delivering text to speech through any browser without
the need to download any special applications, and includes smart delivery
technology to ensure only what is seen is spoken and the content is logically
pathed.
]Others
Some models of Texas Instruments home computers produced in 1979
and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-
to-phoneme synthesis or reciting complete words and phrases (text-to-
dictionary), using a very popular Speech Synthesizer peripheral.
TI used a proprietary codec to embed complete spoken phrases into
applications, primarily video games.[37]
IBM's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.
38 Systems that operate on free and open source software systems
including Linux are various, and include open-source programs such as
the Festival Speech Synthesis System which uses diphone-based
synthesis (and can use a limited number of MBROLA voices),
and gnuspeech which uses articulatory synthesis[38] from the Free
Software Foundation.
Companies which developed speech synthesis systems but which are no
longer in this business include BeST Speech (bought by L&H), Eloquent
Technology (bought by SpeechWorks), Lernout & Hauspie (bought by
Nuance), SpeechWorks (bought by Nuance), Rhetorical Systems (bought
by Nuance).
Speech synthesis markup languages
A number of markup languages have been established for the rendition of text as
speech in an XML-compliant format. The most recent is Speech Synthesis
Markup Language (SSML), which became aW3C recommendation in 2004.
Older speech synthesis markup languages include Java Speech Markup
Language (JSML) and SABLE. Although each of these was proposed as a
standard, none of them has been widely adopted.
Speech synthesis markup languages are distinguished from dialogue markup
languages. VoiceXML, for example, includes tags related to speech recognition,
dialogue management and touchtone dialing, in addition to text-to-speech
markup.
39Applications
Speech synthesis has long been a vital assistive technology tool and its
application in this area is significant and widespread. It allows environmental
barriers to be removed for people with a wide range of disabilities. The longest
application has been in the use of screen readers for people with visual
impairment, but text-to-speech systems are now commonly used by people
with dyslexia and other reading difficulties as well as by pre-literate children.
They are also frequently employed to aid those with severe speech
impairment usually through a dedicated voice output communication aid.
Sites such as Ananova and YAKiToMe! have used speech synthesis to convert
written news to audio content, which can be used for mobile applications.
Speech synthesis techniques are used as well in the entertainment productions
such as games, anime and similar. In 2007, Animo Limited announced the
development of a software application package based on its speech synthesis
software FineSpeech, explicitly geared towards customers in the entertainment
industries, able to generate narration and lines of dialogue according to user
specifications.[39]
The application reached maturity in 2008, when NEC Biglobe announced a web
service that allows users to create phrases from the voices of Code Geass:
Lelouch of the Rebellion R2 characters.[40]
40TTS applications such as YAKiToMe! and Speakonia are often used to add
synthetic voices to YouTube videos for comedic effect, as in Barney Bunch
videos. YAKiToMe! is also used to convert entire books for personal podcasting
purposes, RSS feeds and web pages for news stories, and educational texts for
enhanced learning.
Software such as Vocaloid can generate singing voices via lyrics and melody.
This is also the aim of the Singing Computer project (which uses GNU
LilyPond and Festival) to help blind people check their lyric input.[41]
Next to these applications is the use of text to speech software also popular
in Interactive Voice Response systems, often in combination with speech
recognition. Examples of such voices can be found
at speechsynthesissoftware.com or Nextup.
C Sharp (programming language)
C# (pronounced "see sharp")[6] is a multi-paradigm programming
language encompassing imperative, declarative, functional, generic, object-
41oriented (class-based), and component-oriented programming disciplines. It was
developed by Microsoft within the .NET initiative and later approved as a
standard by Ecma(ECMA-334) and ISO (ISO/IEC 23270). C# is one of the
programming languages designed for the Common Language Infrastructure.
C# is intended to be a simple, modern, general-purpose, object-oriented
programming language.[7] Its development team is led by Anders Hejlsberg. The
most recent version is C# 4.0, which was released on April 12, 2010.
Microsoft Visual Studio
Microsoft Visual Studio is an integrated development environment (IDE)
from Microsoft. It can be used to develop console and graphical user
interface applications along with Windows Forms applications, web sites, web
applications, and web services in both native code together withmanaged
code for all platforms supported by Microsoft Windows, Windows
Mobile, Windows CE, .NET Framework, .NET Compact Frameworkand Microsoft
Silverlight.
Visual Studio includes a code editor supporting IntelliSense as well as code
refactoring. The integrated debugger works both as a source-level debugger and
a machine-level debugger. Other built-in tools include a forms designer for
building GUI applications, web designer, classdesigner, and database
schema designer. It accepts plug-ins that enhance the functionality at almost
42every level—including adding support forsource-control systems
(like Subversion and Visual SourceSafe) and adding new toolsets like editors and
visual designers for domain-specific languages or toolsets for other aspects of
the software development lifecycle (like the Team Foundation Server client:
Team Explorer).
Visual Studio supports different programming languages by means of language
services, which allow the code editor and debugger to support (to varying
degrees) nearly any programming language, provided a language-specific
service exists. Built-in languages include C/C++ (via Visual C+
+), VB.NET (via Visual Basic .NET), C# (via Visual C#), and F# (as of Visual
Studio 2010[2]). Support for other languages such as M,Python, and Ruby among
others is available via language services installed separately. It also
supports XML/XSLT, HTML/XHTML, JavaScriptand CSS. Individual language-
specific versions of Visual Studio also exist which provide more limited language
services to the user: Microsoft Visual Basic, Visual J#, Visual C#, and Visual C+
+.
Microsoft provides "Express" editions of its Visual Studio 2010 components
Visual Basic, Visual C#, Visual C++, and Visual Web Developer at no cost.
Visual Studio 2010, 2008 and 2005 Professional Editions, along with language-
specific versions (Visual Basic, C++, C#, J#) of Visual Studio 2005 are available
43for free to students as downloads via Microsoft's DreamSpark program. The 90-
day trial version of Visual Studio can be downloaded by the general public at no
cost.
Text-to-Speech Technology-Based Programming Tool
44Introduction
According to the World Health Organization (WHO) globally, an estimated 40 to
45 million people are blind and 135 million have low vision [1]. In Australia over
480,000 Australians are vision impaired in both eyes, while over 50,000 are blind.
This number is expected to increase to more than 87,000 people within 20 years
[2]. Currently, there are screen reader tools such as JAWS [3], Brailliant Braille
[4] and Window-Eyes Screen Reader [5].
However, the costs for these tools are high and there is no tool that integrates
the environment for compiling and debugging programs. Furthermore, there is
not enough assistance for helping them learn to program in the leading edge
language C#. Blind programmers could compete in the IT industry when
infrastructure suited mainframes more [6]. These days, with all of computers in
the workplace, graphical windows applications are far more common. This
means that blind programmers are now at a competitive disadvantage in the
workplace and require special tools to be productive.
Blind and vision impaired people require two things to become programmers.
They need up to date knowledge of leading technology, and tools that meet
their own requirements [7]. This affects employment levels for blind and low
vision people. With the current unemployment rate for blind and vision impaired
at almost 70%, which is over four times the national average, specialized tools
could help a great deal of people [8]. Our research project is to design an audio
programming tool that meets specific needs of blind and vision impaired people
in learning C# programming language.
45There are different forms of visual impairment, some people are blind from birth
or from a very early age, others lose their sight as a result of accidents, disease
or some affects of medication [10]. Therefore we concentrate on text-to-speech
technology and we assume that blind and vision impaired people are not hearing
impaired. The text-to-speech technology is used to make all components in the
programming tool voice enabled. Text and other graphics features such as
control size, location, and color that a normal vision user can see on the screen
will be spoken out by a speech synthesizer.
This tool has opened a great possibility that allows blind and vision impaired
users to become programmers in the future. Currently, blind and vision impaired
people have little access to current tools and assistance required for them to
learn programming languages. Our aim is to help them achieve equality of
access and opportunity in information technology education that will ensure
meaningful and equitable employment for their lives.
We have invited blind and vision impaired people to evaluate our programming
tool. Evaluations have shown that the tool can help them design and implement
programs effectively. Our research project can potentially impact the lives of blind
and low vision people. This coupled with the impending labor shortage, as the
baby boomers retire, means that anything that can give blind people an
opportunity to acquire practical, technical qualifications could greatly benefit blind
people and the whole economy. A tool that teaches programming is also a
46programming tool and it can potentially give jobs to people who were previously
unemployable. Our research project will also impact software development
companies, governments, and educational institutions to develop software
packages, educational programs and policies that meet the needs of blind and
vision impaired people.
Current Applications and Projects
Optical character recognition and text-to-speech technologies are currently used
in software applicationsfor blind and vision impaired people. The first application
is for reading books or newspapers. The optical character recognition
technology is applied to scanners that scan text and read it aloud. Typical
devices for this application are Extreme Reader [11], Ovation and SARA
(Scanning and reading Appliance) [12] provide blind users access to printed
and electronic materials.
Those are converted from text to speech and read aloud. Kurzweil system scan
documents, store in files, and convert those to audio output [13]. Furthermore,
Optical Braille Recognition (OBR) allows a user to scan a Braille page and
convert it in to text [14]. This is a Windows software application to retrieve
information that can be presented as the text used in all types of Windows
applications. The Braille information in a small letter can be retrieved into
computer form in the same easy way. For reading text materials in computer, the
most popular software for blind users is JAWS [3].
47
This software provides speech and Braille access to Windows operating system
and applications including Internet Explorer without the need of special
configurations. JAWS also provides a way to access Web pages. A research
project has been undertaken by Curtin University, Cisco Systems and the
Western Australia Association [10].
The project is to identify tools and techniques appropriate for vision impaired
students to study computer at tertiary level. This project recommends
improvements included the need for professional development for lecturers and
improved student access to electronic educational materials.
A computer education project recognized by Stockholm Challenge [15] is to
reduce the digital divide and provide education and learning tools in digital format
not available for the blind in Vietnam on paper support such as school books
newspapers and reading material.
This project aims to create a generation of blind computer users at different level
nationwide, and to provide a community place to acquire computer skills and
share information. However, there is no existing software application designed to
help blind and vision impaired people learn programming subjects in information
technology and engineering. This motivates us to design and implement a simple
yet efficient programming tool for blind and visual impaired users to develop
software applications. In the next section, we will present our proposed
48programming tool and show how we can implement it.Testing and evaluation are
also presented.
Proposed Audio Programming Tool
It is seen that the more formats of material people can access, the higher their
employment opportunities are. There is a higher need for technical skills amongst
people who are blind or have low vision. Blind people require supporting tools
that meet their specific needs. The programming tool is designed not only for
blind users but also for vision impaired and normal vision users. The interface
should be designed in a way that complies W3C standards for vision impaired
users andshould be user friendly. The programming tool should be able to help a
blind user edit, save, compile, debug and run a program. Moreover, the tool
should have program templates and intellisense (auto-completion) options for
user convenience. In order to achieve these objectives, an iterative approach
was used. Each part was developed, tested then improved upon and tested
again.
This meant that usability issues were always found and improved. The tool has
been designed to provide voice for blind users and display suitable font,
font sizes and color scheme for vision impaired and normal vision people.
3.1 Audio Code Editor
A user starts editing a program or loading an existing program using audio code
editor. The program on the editor can be saved to a file or can be compiled,
49debugged and run. For each character entered, the code editor can speak it out.
The user can use left, right, up and down arrow key to check any character in the
program by voice. Some of key requirements for the code editor are as follows:
• Tell the user whenever it is loaded or activated.
• Ask the user’s confirmation before it is closed; saving a file or opening a file.
• Tell the user the current line number.
• An option for the user to specify a line number and go to that line.
• Templates created in advance for every Console application and Windows
application.
• Speaks all characters on a line of code.
• For Windows Applications, the user will design the graphical user interface by
typing details (size, location, text, name, etc.) on the code editor. The code editor
will convert details to C# code and place the code to a file.
• Allow the user to write C# code for event handlers.
• Help the user write code quickly and correctly by speaking out properties,
classes, etc
503.2 Audio Compiler and Debugger
The code compiler uses the C# software development toolkit (SDK) to compile
the program. However, to have voice output, we add code for voice accordingly
to the current program using a code modifier then use the C# SDK to compile the
modified program. For Console application, adding code for voice can be
performed by identifying code for text output then add code for voice accordingly.
For Windows program, adding code for voice is more complex. Mouse and key
event handlers will be added for the user to use mouse or keyboard to design a
Windows form. Voice will be output when a control on the Windows form is
focused to let the user know what the control is. The compiler also lets the user
know if the compilation is successful or if there is a compiling error.
When there is a compiling error it then tells the use that there are compiling
errors then reads out all the errors details, with the file name and line number. If
the user presses predefined shortcut keys, it stops reading, jumps to that line in
that file and reads that line to the user. The user can now fix the code and
presses the combination key to hear the next error if any.
3.3 Audio Output
The code compiler uses the C# software development toolkit (SDK) to compile
the program. However, to have voice output, we add code for voice accordingly
to the program before it is compiled. This is done for any program that provides
non-graphics or graphics output. Mouse or key event handlers will be added to
provide audio output when the user moves the mouse over a control or presses
the Tab key to focus on that control.
3.4 System Architecture
51Figure 1 presents architectural design of the audio programming tool. C# and
text-to-speech software development toolkits (SDK) are used. User can start a
new project by choosing a template in a list of available templates. If the project
is a Windows application, then the user can use the built-in GUI builder to create
Windows controls by entering property values such as location, name, text, size,
etc. When the user writes code, the built-in code auto-completer will help user
write long class or method names.
When the user finishes the program and wants to compile and run it, the compiler
will analyze the program and add code to produce voice accordingly. The
modified program will be compiled and debugged. Errors if any will be output to a
file and the speech SDK will read out an error at a time and guide the user to the
line of code that contains the error in the program. This procedure will be
repeated until there is no error in the program and the C# SDK will run it. Voice
and text or graphics will be output and the user can use mouse or shortcut keys
to check the outputs.
52
53
It is noted that if the blind user save the project to files and run it in the normal
Visual Studio.NET, the output will be text or graphics only. Voice output is only
available if the user runs the project in the audio Studio.NET.
4 Testing and Evaluation
The proposed audio programming tool has been tested and evaluated by normal
vision users then by blind and vision impaired users. In the first test, normal
vision users were required not watching the computer monitor when they tested
the programming tool. It was observed that they were able to do all stages in
writing a program by listening to voices output from the tool. In the second
test, standard keyboards and built-in text-to-speech tools were used. We found
that vision impaired and blind users were also able to perform the same task.
However, vision impaired users were interested in applications with mouse and
blind users prefer those with keyboard. Most of blind and vision impaired people
are familiar with shortcut keys defined in JAWS, so adding new shortcut keys in
the programming tool is not recommended. Shortcut keys have been changed to
meet their specific needs. More programming lessons need to be provided to
help users be familiar with programming in .NET.
54
5 Conclusion
We have presented our design and implementation of an audio programming tool
for blind and vision impaired people to learn programming in C#, a .NET
language. The programming tool was designed not only for blind and vision
impaired users but also for normal vision users. The programming tool was able
to help a blind user edit, save, compile, debug and run a program.
Moreover, the tool also had program templates and auto- completion options for
user convenience.
The tool has opened a great possibility that allows blind and vision impaired
users to become programmers in the future and to achieve equality of access
and opportunity in information technology education that will ensure meaningful
and equitable employment for their lives.
55
References:
[1] World Health Organization (2003). Retrieved from
http://www.who.int/mediacentre/news/releases/200
3/pr73/en/
[2] Access Economics (2004) Clear Insight: The
Economic Impact and Cost of Vision Loss in
Australia http://www.bca.org.au/natpol/statistics/
[3] JAWS (2007), retrieved from the following site
http://www.freedomscientific.com/fs_products/soft
ware_jaws.asp
[4] Brailliant Braille (2007), retrieved from the site
http://humanware.ca/web/en/p_OP_Brailliant.asp
[5] Window-Eyes Screen Reader http://www.tandt-
consultancy.com/window_eyes.html
[6] Alexander Steve, (1998) Blind programmers face
an uncertain future. Retrieved from the CNN:
http://www.cnn.com/TECH/computing/9811/06/bli
ndprog.idg/index.html
[7] Elkes, J. G. (1982) Designing Software for Blind
Programmers. Public Utilities Commission of
Ohio. Retrieved from an online article repository:
http://delivery.acm.org/10.1145/970000/964173/p1
56
5-
elkes.pdf?key1=964173&key2=4640659711&coll=
GUIDE&dl=GUIDE&CFID=22945606&CFTOKE
N=95515984.
[8] Vision Australia (2007) Results and Observations
from Research into Employment Levels in
Australia. Retrieved from the following site
http://www.visionaustralia.org.au/docs/news_event
s/Employment_Overview.doc.
[9] Kopecek & Jergova (1998) Programming and
visually impaired people, in Proceedings of
ICCHP’98, Wien-Budapest.
[10] Ian Murray and Helen Amstrong (2004) “A
Computing Education Vision for the Sight
Impaired”, in Proceedings of the sixth Australasian
Computing Education Conference.
[11] Extreme Reader, retrieved from the following web
site http://www.brailler.com/extrdr.htm
57
[12] Ovation and SARA, retrieved from the web site
http://www.abledata.com/abledata.cfm?pageid=193
27&ksectionid=19327&top=13293
[13] Kurzweil Education System. Retrieved from the
web site http://www.kurzweiledu.com/
[14] Optical Braille Recognition, retrieved from the web
site http://www.neovision.cz/prods/obr/
[15] Computer Education for blind people. Retrieved
http://www.stockholmchallenge.se/data/computer_
education_and_it