the processing analysis of effective recognition quality
TRANSCRIPT
American Journal of Computer Science and Engineering 2018; 5(4): 88-100
http://www.openscienceonline.com/journal/ajcse
The Processing Analysis of Effective Recognition Quality for Arabic Speech Based on Three-Layer Neural Networks
Mohamed Hamed
Faculty of Engineering, Port Said University, Port Said, Egypt
Email address
To cite this article Mohamed Hamed. The Processing Analysis of Effective Recognition Quality for Arabic Speech Based on Three-Layer Neural Networks.
American Journal of Computer Science and Engineering. Vol. 5, No. 4, 2018, pp. 88-100.
Received: July 16, 2018; Accepted: July 26, 2018; Published: September 1, 2018
Abstract
This paper studies the rate of recognition (RR) for the Arabic speech depending on various techniques either supervised or
unsupervised learning concepts. It also, studies the accuracy of recognition rate. All networks used in the research are based on
the neural network simulation as a software for supervised / non-supervised training processing is considered. Some words are
selected as the most famous. The concept for recognition depends on a vital point where the segmentation of a word
(utterances) has been divided into a fixed number of segments for any utterance but each one of these segments may include
different numbers of frame intervals. Some words at various sounds were detected and the recognition rate is computed. The
computational time is developed despite the long time required for the processing which reaches some hours at low speed
processing, but this time can be reduced greatly in the practical applications with recent speedy processing systems. The
appeared error during the training phase is developed and illustrated. Results prove a good recognition rate for some words and
the best number of units in the hidden layer of neural networks for Arabic speech recognition is derived through either the
number of sweeps in the training phase or in the actual percentage recognition results.
Keywords
Arabic Speech Recognition Quality, Back Propagation, Neural Network, Segmentation Feature, Computational Training Time
1. Introduction
Many English nouns, verbs, and adjectives begin with a
stressed syllable, and listeners exploit this tendency to help
parse the continuous stream of speech into individual words.
However, the acoustic manifestation of stress depends on the
variety of English being spoken [1]. In two visual world eye-
tracking experiments, some scientists tested if Indian
English-accented speech causes Canadian English listeners to
make stress-based segmentation errors. Participants heard
Canadian- or Indian-accented trisyllabic sequences that could
be segmented in two ways, depending on the perceived
location of stress. They suggested that, Indian English-
accented speech impairs segmentation in Canadian listeners,
and that both accented pitch and other features of the Indian
English accent contribute to segmentation difficulties.
Findings are interpreted with respect to models of how
similarity between two languages impacts the listener’s
ability to segment words from the speech stream [1].
The perception of British English vowels and consonants
by native Saudi Arabic learners of English from a range of
proficiency levels have been studied before [2]. Twenty-six
participants completed consonant and vowel identification
tasks in quiet and noise. This study has predicted difficulties
with vowel perception in production, participants also
recorded vowels embedded in words and read a short story. It
has been determined as a conclusion that, all learners were
better able to identify consonants than vowels in quiet and
noise, with more experienced learners outperforming early
learners [2].
Although learners were likely able to rely on mapping
non-native to native categories when identifying consonants,
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 89
there was some evidence that they had started to establish
new vowel targets. This appeared to start early in learning,
but even highly experienced learners continued to find
vowels with no direct Arabic counterpart difficult. It is well
known that, there is some evidence for a link between
perception and production: vowel perception is always better
in those who had more accurate production. Overall, the
results of the past works shed light on problematic phonemic
contrasts for Arabic learners and suggest that though learners
and trainers may be able to establish new phonetic categories
early in learning, other contrasts continue to remain difficult
even for highly experienced learners [2].
Now the computers have been introduced in many
applications widely even with the military fields so that all of
them encourage others to be directed towards the utilization
of such modern techniques. Also, the dependence on the
typewriting on computers may consume a long time that
must be minimized as possible leading to that many trails
were done specially for English [3].
Some studies investigated many concepts for the general
recognition as whether, and to what degree, late bilinguals of
divergent backgrounds are comparable to native speakers in
the phonetic implementation of tonal targets [4]. Also, these
were related to whether speakers exhibit general patterns of
acquisition irrespective of the typological closeness, and
whether learners’ choice of accent contours and the
alignment of the high tone proceeds in parallel with
proficiency. More specifically, the acquisition of the nuclear
contour composition and the alignment of the American
English (i.e. pitch accent and boundary tone combination)
examined in initial-stressed and final-stressed words by
Japanese and Spanish late bilingual speakers at varying
proficiency levels in American English [4].
This investigation clarified that, Spanish speakers were
more comparable than Japanese speakers to the native
English speakers in the phonological aspect of intonation
(choice of pitch accent contour). In terms of peak alignment,
the late bilinguals generally were tended to realize
significantly later alignment than the native speakers,
although the precise manifestation of this varied according to
the background of speakers and the stress pattern of words
[4]. On the other hand, a small number of researches related
to the Arabic speech can be read and they all help us in our
works, but more work should be implemented to make it easy
to treat the Arabic language [5]. However, we may hear about
the new in the field of computer applications every day not
years due to the concentration in testing and it would be
mentioned that the deduced results are in vast amounts giving
a lot of new applications.
It is unclear whether the association between the visual
attention (VA) span and reading differs across languages.
This relationship in Arabic was studied, where the use of
specific reading strategies depends on the number of
diacritics on words: reading vowelized and non-vowelized
Arabic scripts favor sub-lexical and lexical strategies,
respectively [6]. It was concluded and hypothesized that, the
size of the VA span and its association to reading would
differ depending on individual “script preferences.” Children
who were more proficient in reading fully vowelized Arabic
than non-vowelized Arabic (VOW) were compared to
children for whom the opposite was true (NOVOW).
NOVOW children showed a crowding effect in the VA span
task, whereas VOW children did not. Moreover, the crowding
in the VA span task correlated with the reading performance
in the NOVOW group only. This depends on considering
individual differences on the use of reading strategies in
Arabic [6, 7]. Computers are very effective for the quick
application, especially for the automatic control and the eye
detection or fine checks required either for industry or others.
Also, in the way of computer use we can see the presented
item which is directed towards the Arabic speech for either
the spoken or the written words [7-9].
2. Problem Formulation
It has been well defined that, a speech addressed to adults,
words are seldom realized in their canonical, or citation,
form. For example, the word ‘green’ in the phrase ‘green
beans’ can often be realized as ‘greem’ due to English place
assimilation, where word-final coronals take on the place of
articulation of neighboring velars [10]. In such a situation,
adult listeners readily ‘undo’ the assimilatory process and
perceive the underlying intended lexical form of ‘greem’ (i.e.
they access the lexical representation ‘green’). An interesting
developmental question is how children, with their limited
lexical knowledge, come to cope with phonologically
conditioned connected speech processes such as place
assimilation. Some scientists addressed an issue by
examining the occurrence of place assimilation in the input to
English-learning 18-month-olds [10]. Perceptual and acoustic
analyses of elicited speech, as well as analysis of a corpus of
spontaneous speech, all converge on the finding that
caregivers do not spoon-feed their children canonical tokens
of words. Rather, infant-directed speech contains just as
many non-canonical realizations of words in place
assimilation contexts as adult-directed speech [10].
Previous studies have shown that native English speakers
outperformed non-native English speakers in perceiving
English speech under quiet and noisy listening conditions.
The difference between native English speakers and native
Chinese speakers on using contextual cues to perceive speech
in quiet and multi-talker babble has been tested [11] where 3
types of sentences served as the speech stimuli in: sentences
with high predictability (including both semantic cues and
syntactic cues), sentences with low predictability (including
syntactic cues), and finally sentences with zero predictability
(consisting random sequences of words). These sentences
were presented to native-English and native-Chinese listeners
in quiet and four-talker babble with the signal-to-noise ratio
at 0 and -5 db. Preliminary results suggested that, native
Chinese speakers primarily rely on semantic information
when perceiving speech in quiet, whereas native English
speakers showed greater reliance on syntactic cues when
perceiving speech in noisy situations. The difference between
90 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
native English speaker and native Chinese speakers on
syntactic, and semantic information utilization in various
listening conditions discussed [11].
The pairwise variability index (PVI) was applied, a rhythm
metric that quantifies variability in speech rhythm, to the
classification of speech varieties [12]. This technique
combines the Particle Swarm Optimization (PSO) algorithm
with a generalization of several rhythm metrics that are based
on the PVI. The performance of this optimization-oriented
classification compared with classification that uses
conventional (both PVI-based and interval-based) rhythm
metrics. Application was made to the classification of native
and non-native Arabic speech using data from the West Point
Arabic Speech Corpus; experiments were based on segmental
durations and use Support Vector Machine (SVM)
classification. Results showed that, the optimization-oriented
classification provides a better discrimination between native
and non-native speech varieties than classification based of
the conventional rhythm metrics. When added to different
combinations of these conventional metrics, the
optimization-oriented procedure consistently improved the
classification rates [12].
Single channel speech separation (SCSS) is widely used in
many real-time applications such as preprocessing stage for
speech recognition to control humanoid robots and in hearing
aid. The performance of the separation is crucial for these
applications. Some researchers proposed an innovative
approach for unsupervised SCSS [13]. The separation relies
on an optimization of the subspace separation by
decomposing the mixed signal into three estimates which
were namely; the sparse subspace, the sub-sparse subspace
and the low-rank subspace. Soft mask used in the core of that
proposed approach for the final decision. The proposed
system generates two separated signals of different qualities
and provided in two different channels [13].
The channel classification is done using Fuzzy logic which
requires two parameters. The first parameter is the quality of
separated signal based on a nonintrusive metric for speech
quality and intelligibility. The second parameter was the
gender of the speaker, determined using a tracking algorithm.
The evaluation results of the proposed approach reported and
compared to others. The proposed method on average
achieves 67.9% improvement in PESQ, 59.5% improvement
in signal-to-interference ratio (SIR) and 10.5% improvement
in the target-related perceptual score (TPS) versus the
benchmark methods [13].
As the neural networks have been developed to the
simulation on computers, the utilization of this advanced
process was meeting with an immense value and still. The
main item in the processing will be the training for such
concept to find the correct reference for measuring the
required word in our case although the more modern
software may not depend on the supervision principle for
these implementations. Thus, the unsupervised and
supervised training techniques would be helpful reaching the
most suitable. In unsupervised phase only, the training
patterns must be registered, and the system will be able to
classify them due to the stored before features [8].
Scientists investigated the use of multi-distribution deep
neural networks (MD-DNNs) for an automatic lexical stress
detection and a pitch accent detection, which are useful for
suprasegmental mispronunciation detection and diagnosis in
second-language English speech [14]. The features used
cover syllable-based prosodic features (including maximum
syllable loudness, syllable nucleus duration and a pair of
dynamic pitch values) as well as lexical and syntactic
features (encoded as binary variables). As stressed/accented
syllables are more prominent than their neighbors, the two
preceding and two following syllables were also taken into
consideration. Experimental results showed that, the MD-
DNN for lexical stress detection achieves an accuracy of
87.9% in syllable classification (primary/secondary/no stress)
for words with three or more syllables. This performance is
much better than those of our previous work using Gaussian
mixture models (GMMs) and the prominence model (PM),
whose accuracies are 72.1% and 76.3%, respectively.
Approached similarly as the lexical stress detector, the pitch
accent detector obtained an accuracy of 90.2%, which was
better than the results of using the GMMs and PM by about
9.6% and 6.9%, respectively [14].
The given research depends on the receiving of the input
word through a special electronic chip to go on the
processing phase as shown in Figure 1 [15]. This is a vital to
find the correct reference word with the help of neural
network to save the computational time needed for such aim.
It should be indicated that the used chip is a standard one
without any interference that put us to explain its
performance as written below.
The sound card, which is known as The Sound Blaster
PRO (SBPro), is a commercial card with its own software
[8]. Its performance and main specifications are listed in
Table A1 (See the Appendix) where it requires at least a
computer model of 286 AT due to the use of 16-bit slot while
XT models may be possible for 8-bit cards. The recorded
sound can be repeated at the original quality so that all
necessary words may be stored and then playback them
individually or at time. Also, the card permits the possibility
to vary the sound intensity according to the volumes for the
purpose either for input or output. This given card is a
suitable module due its perfect and exact company check that
will help us to work with it without any check or examination
and consequently it leads us to concentrate on the speech
signal quality in the given zone of frequency.
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 91
Figure 1. A block scheme for the recognition processing system of Arabic
speech.
3. Word Detection
Since the time is accounted at both silence and spoken
phases, the separation between them will be very important.
So, a special technique should be supplied or chosen to get a
line separating the silence zone and the spoken one. This
process is known as the thresholding where the region may
not be silence completely at all due to the presence of noises
or others. Then, the concept of thresholding must be also
suitable to separate small times between words and sentences
where the used words in the tests of training and recognition
processing are listed in Table A2 (See Appendix) [18].
This means the detection for the beginning and the end of
a speech as well as the start moment of a word and its end.
Although this is an important technical point from the
accuracy point of view, the work here will depend on the
known technique in this accordance which is defined by the
Zero Crossing Rate (ZCR). This means that at every moment
the zero-crossing rate for the signal measured should be
calculated as shown in Figure 2 for the Arabic word
expressing “zero” digit [9].
Figure 2 presents a lot of ZCR concentrated together while
there are some little of them at different intervals. This leads
to find and check that the word is located at the time of
concentrated ZCR and noises at the little as illustrated for the
actual signal, shown in Figure 4, where the word energy is
found at the same frame numbers. The others are very small
to be neglected.
Figure 2. The calculated (ZCR) for the word representing "zero".
It must be indicated that, the thresholding is a main
parameter that affecting the accuracy of measurement in
general due to the correct determination for the exact length
of a word. For the tested word in Figure 2, this thresholding
at the word’s start and end points is indicated and
consequently, it is repeated as illustrated in Figure 3 [15].
92 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
Figure 3. Energy signals of the word representing "zero".
3.1. Segmentation
It is necessary to state that; a word is represented through
the energy of each voice signal to be passed to the processing
and so each word consumes a certain individual self-length.
This means that, the word length (utterance) will be different
for a word and another one so that a new problem may be
created that would be treated. Then, an utterance depends on
either the speed of speech or the type of word itself and they
varied typically (for Arabic Speech) from 25 to 75 ms while
the number of segments for an utterance would be varied
[15].
However, the normalization concept may be needed to
overcome the problem of variation in the number of features
based on different utterances for words and the known time
warping could be used to give a fixed length for each
utterance. Since the success of time wrapping to solve this
point represents the target, the word duration as input word
length Wlength would be defined as a function of the number
of frames Nframe for the utterance and the number of sections
in each frame Nsections. This can be given mathematically
according to the formula:
Wlength = Nframe × Nsections (1)
Thus, if the number of frames per utterance is kept
constant, the number of features inside each frame will be
changed to keep the word utterance, the input word length, as
a fixed value (unvaried). This technique is used successfully
in the first approach tested in the work [15]. So, the
segmentation process may be a correct way to express the
signals of a word as input to the system to be treated and then
trained by the neural networks.
3.2. Algorithms
Three concepts are used and tested [15] where Figure 4
clarifies the first concept through a block-diagram including
both back propagation (BP) as a second stage to check all
results, even that words escaping from the first stage, that the
so called segmental organizing map (SOM). It is the time to
illustrate that, the software available may be based on the
simple neural network where it consists of three layers (input,
output and hidden in between) with different hidden units in
the hidden layer [15]. This system may depend on the biasing
concept sometimes for acceleration of computational
processing while it is used in the simplest form in researches.
Figure 4. The combined system used in the experiments.
The second and third techniques are the well-known
integrated neural networks (INN) and BP networks [15, 16].
Generally, in all methods the features appear to be the main
item to define a certain word or generally to find a specific
unique feature for each word individually to be different from
any others. This may be either parametric such as linear
predictive coding or non-parametric as time and frequency
domain representation (auto-correlation, energy level or the
illustrated above (ZCR). The deduced feature will be the
marks of the specified word and it should be supplied for
training procedure [18].
The BP will be represented for the supervised training
concept while the unsupervised one should be the SOM
method. The BP depends on the continuous correlation for
the deduced error during processing and propagation back
until the error becomes within the permissible (predefined)
limits. In the unsupervised method, the output node with
minimum distance from the training pattern is determined
and it changes the weights randomly to measure the
difference between weights. This can be placed on a map to
find the overall distribution for nodes of a network.
The nodes of the output layer must be divided into
different areas (nodes) related to a certain word (words). The
classes of similar word would be introduced in the same area
leading to the failure of SOM to recognize. Hence, another
stage should be connected to stop the overlapping
phenomena and to separate the similar cases. Thus, the error
is measured until the results reach the pre-specified value of
error.
The speaker independent principle may be needed for the
generalization of ultimate results and consequently the
conclusion. Two types of groups of speakers are included
where they have distinctive characteristics of tone, clear and
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 93
intensity. This is considered for the selected sample rate of 10
kHz with 8 bit / sample at linear predictive coding of features
[16, 17]. The general classification of sample tests for the
experimental treatment can be deduced as listed in Table A3
in the Appendix.
A three-layer neural network is defined as the input layer
having 156 nodes to receive the input signal and so the
output layer will be varied according to the type of input
group. This means that, it will be only ten units if the input is
either the fundamental ten digits (0, 1, 2, 3, 4, 5, 6, 7, 8, and
9) or the ten short chosen commands while it would contain
28 nodes (the equivalent number for the Arabic alphabet) if
the input is the alphabet characters. It should be indicated
that, the considered punctuation marks are only four.
Otherwise, the internal layer will be varied for examination
where the units inside the hidden layer will be changed
between 10 and 50. It must be remarked that, increasing the
number of hidden units causes the over-fitting so that it may
lose its effectiveness, taking much effort and time for
processing. Also, the ten fundamental digits (0 – 9) are
treated with respect to the variation in the number of units
(20, 30, 40) in the hidden layer as illustrated in Table A4 (See
Appendix) [15].
It seen that the maximum overall recognition is appeared
at 40 units in the hidden layer with the BP network so that its
value may be taken as a reference for the next deduced
results. Therefore, the above experiment is repeated for the
fundamental alphabet characters of the Arabic language (28
characters as listed in Table A2 in Appendix) at the same
region of hidden units (Table A5 in Appendix) [18].
4. Results
Therefore, since the training of isolated words and spoken
Arabic words expressing the basic digits (0, 1, 2, 3, 4, 5, 6, 7,
8, and 9) for a single speaker is based on segmental BP
where the results will depend on the number of segment
covering frame length (utterance). For small number of
segments, the network may not be able to extract the correct
features but contrary with sizable number of segments, the
possibility to represent exactly the word will be increased.
This phenomenon is studied as shown in Figure 5 where
maximum and minimum frame lengths are considered to
evaluate the suitable number of segments when the hidden
layer contains 50 units [15]. Figure 6 indicates that; 15
segments condition may be the most suitable choice for the
tested sample.
4.1. Recognition Rate (RR)
It is needed to illustrate that, the results depend always on
the sounds of speakers relative to applications, but the
official Arabic language has been implemented for all
trainings and tests. This may facilitate the recognition
accuracy since the spoken characteristics may be different
from a country to another. This phenomenon is highly
presented in the Arabic Language although its written
wording is the same between all Arab countries.
Figure 5. Recognition percentage for different segments.
Consequentially, the effect results of tested number of
patterns is checked experimentally as registered in Figure 6
where 5 patterns are the most valid for the recognition rate
determination reaching the highest value.
Figure 6. The recognition dependency on the number of patterns.
The results of Figure 6 illustrate the recognition quality for
the tested segments of wording and prove the validity of
equation 1. Then, the recognition rate RR for 50 utterances of
the chosen words representing separate wording groups such
as digits, alphabet, short commands and the punctuation
marks has been investigated [15]. The ultimate results of
experiments for the rate of recognition are given in Figure 7
where the recognition rate RR of digit wordings is appeared
less than other test samples while the short commands came
at the best position of RR.
Since the spoken digits wording gave the worst level for
the recognition rate (RR) among others, it may be necessary
to find the recognition accuracy in a sequential degree
system. Whatever, the results of experiments for singular
speaker are repeated for both first and second groups of
speakers have been summarized in average values obtained
with respect to the variation in the number of units in the
hidden layer. Then, these calculated average computational
times and recognition rates for both conditions (singular and
94 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
groups) of speakers as varied w.r.t. the number of hidden
units are developed as ultimate results as shown listed in
Table A6 (See Appendix) for the consumed processing time
when the test of spoken words expressing the digits is
applied.
Figure 7. The average values of RR.
However, the comparison between results for
computational time can be simplified by the transformation
into the per unit system (Figure 8) although maximum and
minimum points can be derived from Table A6 of the
Appendix.
Similarly, the Average Recognition Rate (%) for the
sample of spoken words expressing digits in Arabic speech
language is listed in Table A6 in the Appendix. The
maximum recognition rate is 93.33% although diverse teams
were spoken. Also, the difference between classified groups
is the single speaker and two groups where children and
adults are acted the test [15]. The results of Table A6 in the
Appendix contain two principle measurements for the
computational processing to reach a high-quality recognition
so that these results may be investigated individually. So, the
recognition part would be delayed after the study of
computational processing time where it represents a vital
item for some applications. Its modification can be reached
through more efficient processors so that the deduced results
may be analyzed.
Figure 8. The per unit processing time dependency.
The per unit computation will clarify the study for a good
analysis where the estimated values in per unit are tabulated
in Table 1. The base value for the per unit system would be
the minimum value which it is 3.8730556.
Table 1. The per unit computational time.
Number of
Hidden Units Single speaker
First
group Second group
10 2.3679265 1.078821 1.01471
20 2.7225119 1.2367496 1
30 3.560998 1.5729039 1.1541
40 2.9687298 1.546367 1.8759951
50 4.10184318 2.047407 2.2237682
Back to Table A6 (See Appendix) it may be noticed that,
the recognition rate for singular speaker is very high relative
to the grouping style due to the inclination in the reference
features obtained for computational processing although the
first group gives a relative higher recognition rate RR [15].
The persons inside this group have the same characteristics
which specify the same region of sound waves. The results
are translated into curves as drawn in Figure 9 although the
second group indicates the lower recognition rate (RR). They
are different in the overall performance of each individual
one as a single member in the group, but the deduced values
could be varied again with introducing another person in the
group, instead or adding [15].
The maximum value for the rate of recognition appears for
single speaker at the level of 93.33% while these rates for
first and second groups are 86.1% and 84.2%, respectively.
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 95
Also, this maximum value for the marks recognition appears
to be 90% as shown in Figure 9 while maximum and
minimum points for each category of the tested words are
shown.
It should be mentioned that, if the system cannot
discriminate between two words, the BP concept is inserted
to separate them [15, 17]. This condition has been found for
the spoken words representing the two digits (4) and (9) with
a system consisting of 10 input units and 2 output units [18].
This is based on the varied hidden number of units within 5
and 15 and then, the average per unit recognitions for the
selected sample of input words, expressing the fundamental
characters referring to the above maximum value, have been
estimated. The results of experiments are listed in Table 2
where the best recognition appears at 30 hidden units. Thus,
the choice of 30 hidden units will be the most efficient.
Figure 9. The deduced recognition rate for tested samples.
It is noticed that, the selected sample of Arabic words
consists of four categories (digits, alphabet, a few short
commands and some punctuation marks) where the same
words have been tested for the recognition ability.
Accordingly, all these words are examined through the three
used methods for the rate of recognition. It is seen that; the
recognition rates of commands are the best for SOM + BP
method (Figure 4) where it gives 99.33% recognition. This
vital result is the best between that for other two methods so
that it can be recommended to be applied for the robotic
systems, depending on the commands. Also, the concept of
[15] has the maximum RR to be preferred although
sometimes it gives the lower recognition rate.
Table 2. The average P U recognition (Tested Characters).
No. of hidden
units group INN
word in
group INN
final
BP
network
BP
20 0.9357 0.9715 0.8571 0.6499
30 0.9536 0.9929 0.8928 0.6648
40 0.9500 0.9892 0.89280 0.6643
It must be remarked that, the computational time for the
recognition processing is high and consequently it must be
minimized as possible by the parallelism processing. This
phenomenon leads to transfer the stereo sound signal
(double-channel system) into mono (single-channel system)
before processing to reduce the computational time and
effort.
This principle is like that corresponding to color images
that replaced by the gray pictures unless it is required [17].
Also, this may be useful if there are words at the same node
and there is no way to specify them. Then, the return to
stereo channels will be necessary to reach efficient
recognition. Thus, the mono implementation is proposed for
quick requirements.
4.2. Accuracy
It is deduced from Table A5 (See Appendix) that, the
recognition rate is decreased for the fundamental characters
of Arabic language than that for digits. This may be appeared
due to the raise in the number of input data relative to the
digits and so to the map distribution where the possibility of
sharing between words here may highly increase. This means
that, for the large groups of words, the grouping technology
should be introduced to minimize the computational time and
effort. Therefore, the performance of the required numbers of
sweeps in the process of training is drawn in Figure 10 for
both the digit words and for the commands. For digit
wording it is seen that, the number is decreasing
approximately as exponential while it is vibrated for
commands with some minimum numbers in the region
between 20 and 40 units in the hidden (internal) layer of the
network. Whatever, the grouping effect is developed for that
curve to indicate that the system of grouping makes the
variation in a smooth manner than that without.
On the other side, the accuracy of recognition which can
be specified through the permissible error is a principal factor
and so it is given in Figure 11 where the curves prove that the
error can be neglected at all as its value approaches the zero
value practically. It is seen from this figure that the minimum
error is always with the use of 40 numerous for the hidden
layer of the typical 3-layer neural networks. Although all
results of the calculated error are very small, the error for the
processing of short commands appears to be the minimum
one due to the short utterance for orders in general.
However, the results present the lowest number of sweeps
for group although it must be the largest. Also, the commands
give a low margin of variation with the low sweeps appear at
the 50 units in the hidden layer, but the maximum number of
96 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
sweeps comes with 10 units. It should be mentioned that; this
conclusion is determined for the three-layer neural networks.
Contrary, the digit ultimate results present the largest number
of sweeps during the training processing while its maximum
reaches near 700 and minimum at about 200 with 40 units in
the hidden layer. The variation in the sweeping characteristics
may be appeared due to the different voices and the variation
in utterances so that the training processing can be accurate
with practice. This phenomenon has been occurred because
the Arabic speech is varying in a spread scale.
Figure 10. The deduced number of sweeps during training.
It can be concluded from Figure 3 that; the training phase
is a fundamental step in the processing of speech recognition
in general, but it is very important for the Arabic speech
recognition. The importance here deals with the difficulty in
the accent which can be expressed in very different ways.
Then, the training phase may be applied many times to get
the required accuracy for speech recognition. Thus, the
commands (orders) or digital or even alphabetic wording can
be treated in the same manner although the conclusive results
above showed some different between them.
On the other hand, the grouping system may deliver a
harder difficulty due to the great variation in the style of the
group or the sexual male or female or a combined or even
mixing of children, male and female together. In all cases, the
training results will direct the real state for the Arabic speech
recognition so that the quick recognition can be achieved.
Figure 11. The error during training.
Figure 11 presents the error appeared in the results for the
tested sample of digit and alphabet wording where Figure 12
gives the same conditions for Marks and short commands a
very small error is seen for both the marks and commands.
The shown error is so small so that it may be neglected for all
similar cases. Also, the number of hidden units in the neural
network deduces the most accurate number where it is 20
units for both marks and commands. This means that, the 20
units in the hidden layer should be proposed for application
in this field. Otherwise, a larger error in percentage
evaluation is appeared for digits and alphabet (Figure 11) in
Arabic speech where the variation is shown within the same
margin almost (0.001–0.02) according to the number of units
in the hidden layer. It should be mentioned that, the error of
both is different for the defined number of units in the hidden
layer which means that the recognition of either the digits or
the alphabet differs from each other. It must be noted that all
derived values are deduced from the data of Reference [18].
Figure 12. The error during training (Marks & Commands).
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 97
Since the alphabet words gave the worst RR among the
others in the region of 10-40 hidden units, it is necessary to
find the recognition accuracy for them as driven in Figure 13
in the sequential degree system. Whatever, the experiments
for singular speaker and the same for the first group of
speakers as well as the second group have been summarized
in average values obtained with respect to the variation in the
number of units in the hidden layer. The developed ultimate
results are listed in Table A6 (See Appendix).
This Table A6 shows that, the recognition rate for singular
speaker is very high relative to the grouping system due to
the inclination in the reference features obtained for
processing while the first group gives a higher RR since the
persons inside this group have the same characteristics which
specify the sound waves. The second group indicates the
lower RR because they are different in the overall
performance of each individual one as a member in the
group, but these deduced values could vary again when
another person in the group is introduced.
Figure 13. The recognition accuracy of Arabic alphabet.
Nevertheless, the illustrated results in Figure 13 proves
that there are two wordings have lower level for the
recognition and some has a middle rate for recognition.
Contrary, the almost wordings have a prominent level for
recognition. It should be mentioned that if the system cannot
discriminate between two words, the BP concept is used to
separate them. This condition has been found for the spoken
words representing "4" and "9" leading to study the case with
a system consists of 10 input units and 2 output units where
the hidden layer has a varied number of units in the region
between 5 and 15. The results of experiments are listed in
Table 3 showing that the best recognition appears at both 8
and 10 hidden units but the computational time for 10 units is
less than that for 8 units.
Table 3. The data for words, expressing the digits 4 and 9.
No. of hidden units Computational time (m) Average RR (%)
5 6.266 93.33
8 5.833 96.67
10 3.4166 96.67
15 4.183 93.33
So, the choice of 10 units will be the most suitable. All
these words are examined through the three used methods for
the rate of recognition where the determined results are
tabulated in Table A7 (See Appendix) to show the best
method. From this Table A7 it is seen that, the commands are
the best in all methods so that it can be recommended for use
with robotic systems depending on the commands. Also, the
new concept of [15] has the maximum RR between them and
this leads us to prefer it for the purpose but sometimes its
results may be less than the others [19]. This concept has a
multi-purpose application benefits as given for example in
[20].
It must be remarked that, the computational time for the
recognition processing is high and consequently it must be
minimized as possible by the concept of parallelism. This
phenomenon leads to transfer the stereo sound signal into
mono before processing to reduce the computational time to
the half due to the work with only one channel instead of
two. This principle is like that corresponding to the color
images that replaced by the gray pictures unless it is required
[17]. This may be useful if there are words at the same node
and there is no way to specify them, then the return to stereo
channels will be efficient.
5. Conclusion
Since the target aim an exact and accurate testing for the
determination of Arabic speech, the proposed mathematical
product for the segmentation of utterances for words
represents a good appropriate tool to express, exactly and
easy, the word length. Thus, the accuracy of processing for
Arabic speech recognition with neural networks goes up.
However, as the recognition rate depends on the quality of
training phase, a significant analysis for the testing phase
with different varieties of voices must be inserted, in general.
Thus, the implementation for several types of voice (male
and female) as well as ages (Children, Adults and Old) may
be required so that the training concept must take the high
importance target. This means a principle requirement for
Arabic speech.
It is recommended that the recognition rate for words go
up with the combination of neural concepts if the target is
high quality of recognition. Thus, the per unit analysis
presented in this research proves that the 3-layer neural
networks are suitable and quite enough for the effective
Arabic speech recognition. Integrated neural networks (INN)
for Arabic speech recognition are recommended for
applications in this field with a high percentage rate for the
recognition.
The multi speaker system reduces the processing time for
the rate of recognition for words in Arabic speech although it
98 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
combines two fundamental components for the neural
networks. However, the range of (20-40) unites in the hidden
layer of the network gives the good limits for selection. Thus,
numerous hidden units will consume an excess
computational time and effort.
The computational time required for the recognition
training of Arabic speech is relatively large in general and
thus it needs the use of parallelism in the computer circuits
for processing. Therefore, since the output units depend on
the input data, they may be grouped for minimization.
The concept of time wrapping to solve the problem of
normalization techniques is a vital tendency, to overcome the
variation in the number of features for different utterances.
Then the normalization technique can be successfully used.
However, the processing of stereo sounds needs a long-
time length so that the transformation into the mono sound
mode before recognition processing will be effective for
implementation. Thus, the stereo sound is not required as
processing media where it will be transferred into mono. This
transformation reduces greatly the computational effort
besides the computational time. Otherwise, the recognition
processing of Arabic speech needs the use of parallelism to
minimize the computational time.
The multi speaker system increase the rate of recognition
for words in Arabic speech because it accounts the wide
spread field of tones. Whatever, the parallel processing
technology is the best way to apply the speech recognition in
real time at any possible actual application to reduce the
processing time.
The three-layer neural networks (with only 40 hidden
units) are quite enough to recognize the Arabic speech.
The proposed concept is recommended to be implemented
in other fields such as early recognition and detection of
pathological speaker in the biomedical field and other
relevant items.
Acknowledgements
The author would like to express his appreciation and
thanks to D Wafik, The Higher Institute of Engineering – The
Tenth of Ramadan City – Egypt, for her good help and high
support in the processing of Data.
Appendix
Table A1. The specifications of audio card [18].
Item value
A/D input mono 4.23 kHz
A/D output mono 4.44 kHz
sample resolution 8 bits
sample inputs (mono) 1- mic 2- on line
music / (FM – CMS) 11-voice mono / upgradable
CD ROM connector Yes
Amplifier 4 W, 4 Ohms
Table A2. The used words in processing [18].
Alphabet words Digits Commands Marks
ENGLISH ARABIC ENGLISH ARABIC ENGLISH ARABIC ENGLISH Word Mark Word
Alef أ�ف Dad ر 0 ��د go ط� (.) إذھب�� Baa ء�� Dah وا�د 1 ط� stop و�ب (+) �ف� Taa ء�� Zah ن 2 ظ��ا � start ب (=) إ�دأ��! Thaa ء� Ean ن�3 " � # right ن���ل (،) �& Geem م�� Ghean ن�أر�(� 4 ) left ر�!�
Hah �� Faa 5 &�ء �!�* up +,"إ
khah �* Kkaf 6 ��ف ��! down لإ!
Dal دال Kaf 7 -�ف �)�! come ل�)�
Zal ذال Laam 8 .م ����� open /� إ&
Reh راء Meem م�9 � �)!� close إ),ق
zean ن�ز Noon ون�
Seen ن�! Heh ��ھ
Sheen ن�2 Waw واو
Sad د� yeh ��
Table A3. The samples of input words [18].
phase No. of Digits Alphabet commands marks
training
utterances 50 224 100 20
words 10 28 10 4
utter/word 5 8 10 5
recognition
utterances 150 420 150 40
words 10 28 10 4
utter/word 15 15 15 10
total utterances 200 644 250 60
American Journal of Computer Science and Engineering 2018; 5(4): 88-100 99
Table A4. The average recognition for fundamental digits [18].
No. of hidden units INN BP
Group word in group final network
20 90.00 99.05 90.67 91.33
30 90.67 99.05 91.33 89.33
40 92.67 99.05 92.00 93.33
Table A5. PUav recognition for fundamental characters [18].
No. of hidden units INN BP
group word in group final network
20 0.9357 0.9715 0.8571 0.6499
30 0.9536 0.9929 0.8928 0.6648
40 0.9500 0.9892 0.89280 0.6643
Table A6. The average time and RR of singular and groups of speakers for digits [18].
Number of Hidden Units Computational Time (h: m: s) Average RR (%)
Single First Second Single First Second Marks 10 09:10:16 04:10:42 03:55:48 88.00 84.00 80.45 90 20 10:32:40 04:47:24 03:52:23 91.33 86.10 81.00 90
30 13:47:31 06:05:31 04:28:11 89.33 81.12 83.15 87.5
40 11:29:53 05:59:21 07:15:57 93.33 86.00 84.20 87.5
50 15:53:12 07:55:47 08:36:46 91.33 80.21 81.33 87.5
Table A7. The RR for different methods [18].
word type BP- network INN SOM+BP
digits 93.33 92.00 94.00
characters 72.86 80.00 78.57
commands 98.00 ---- 99.33
References
[1] Kara Hawthorne, Juhani Järvikivi & Benjamin V. Tucker (2018): Finding word boundaries in Indian English-accented speech, Journal of Phonetics, Volume 66, January 2018, (145–160), http s://doi.org/10.1016/j.wocn.2017.09.008
[2] Bronwen G. Evans & Wafaa Alshangiti (2018): The perception and production of British English vowels and consonants by Arabic learners of English, Journal of Phonetics, Volume 68, May 2018, (15-31), https://doi.org/10.1016/j.wocn.2018.01.002
[3] T. Kohonen (1988): The neural phonetic typewriter. IEEE on Computer, Vol. 21, No. 3, (11-22).
[4] Calbert Graham & Brechtje Post (2018): Second language acquisition of intonation: Peak alignment in American English, Journal of Phonetics, Volume 66, January 2018, (1–14), https://doi.org/10.1016/j.wocn.2017.08.002
[5] Elizabeth K. Johnson, Amanda Seidl & Michael D. Tyler (2014): The Edge Factor in Early Word Segmentation: Utterance-Level Prosody Enables Word Form Extraction by 6-Month-Olds, https://doi.org/10.1371/journal.pone.00835464), https://doi.org/10.1016/j.wocn.2017.08.002
[6] Marie Lallier, Reem Abu Malloih, Ahmed M. Mohammed, Batoul Khalifa, Manuel Perea & Manuel Carreiras Basque (2018): Does the Visual Attention Span Play a Role in Reading in Arabic? Scientific studies of reading Journal, Volume 22, issue 2, 2018, https://doi.org/10.1080/10888438.2017.1421958
[7] Charles Hulme and Margaret J. Snowling (2014): The interface between spoken and written language: developmental disorders, Philos Trans R Soc Lond B Biol Sci. 2014 Jan 19; 369 (1634): 20120395. DOI: 10.1098/rstb.2012.0395.
ttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866425/
[8] A. Stolz (1993): The Sound Blaster Book. Abacus, MI, USA.
[9] Mathias Barthel, Sebastian Sauppe, Stephen C. Levinson and Antje S. Meyer (2016): The Timing of Utterance Planning in Task-Oriented Dialogue: Evidence from a Novel List-Completion Paradigm, December 2016, https://doi.org/10.3389/fpsyg.2016.01858 https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01858/full
[10] Helen Buckler, Huiwen Goy & Elizabeth K. Johnson (2018): What infant-directed speech tells us about the development of compensation for assimilation, Journal of Phonetics, Volume 66, January 2018, (45-62), https://doi.org/10.1016/j.wocn.2017.09.004
[11] Ling Zhong & Chang Liu (2018): Speech Perception for Native and Non-Native English Speakers: Effects of Contextual cues, The Journal of the Acoustical Society of America, Volume 143, 2018, https://doi.org/10.1121/1.5036397
[12] Soumaya Gharsellaoui, Sid Ahmed Selouani, Wladyslaw Cichocki, Yousef Alotaibi & Adel Omar Dahmane (2018): Application of the pairwise variability index of speech rhythm with particle swarm optimization to the classification of native and non-native accents, Journal of Computer Speech & Language, Volume 48, March 2018, (67-79), https://doi.org/10.1016/j.csl.2017.10.006
[13] Belhedi Wiem, Ben Messaoud, Mohamed anouar, Pejman Mowlaee and Bouzid Aicha (2018): Unsupervised single channel speech separation based on optimized subspace separation, Journal of Speech Communication, Volume 96, February 2018, (93-101), https://doi.org/10.1016/j.specom.2017.11.010
100 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech
Based on Three-Layer Neural Networks
[14] Kun Li, Shaoguang Mao, Xu Li, Zhiyong Wu & Helen Meng (2018): Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks, Journal of Speech Communication, Volume 96, February 2018, (28-36), https://doi.org/10.1016/j.specom.2017.11.003
[15] R. E. Atta (1996): Arabic Speech to Text Translator. M. Sc. Thesis, Suez Canal University, Port Said, Egypt, pp. 162.
[16] Debbie Greenstreet & John Smrstik (2017): Voice as the user interface – a new era in speech processing, May 2017 (1–9), http://www.ti.com/lit/wp/slyy116/slyy116.pdf
[17] M. Hamed (1997): A quick neural network for computer vision of gray images. Circuits, Systems & Signal Processing Journal, USA, Vol. 16, No. 1. https://link.springer.com/content/pdf/10.1007/BF01183174.pdf
[18] Mohamed Hamed & Dalia Wafik: A Multi-Speaker System for Arabic Speech Perception, accepted, Open Science Journal of Electrical and Electronic Engineering, Paper No. 7350160, 2018, Vol. 5, No. 2, 2018, pp. 11-17., Received: April 9, 2018; Accepted: May 4, 2018; Published: July 5, 2018, http://www.openscienceonline.com/journal/archive2?journalId=735&paperId=4309
[19] KP Braho, JP Pike, LA Pike: US Patent 9,928,829, 2018 : Methods and systems for identifying errors in a speech recognition system.
[20] Tobias Hodgson, Farah Magrabi & Enrico Coiera: Evaluating the usability of speech recognition to create clinical documentation using a commercial electronic health record, International Journal of Medical Informatics, Volume 113, May 2018, Pages 38-42, https://doi.org/10.1016/j.ijmedinf.2018.02.011