the processing analysis of effective recognition quality

American Journal of Computer Science and Engineering 2018; 5(4): 88-100

http://www.openscienceonline.com/journal/ajcse

The Processing Analysis of Effective Recognition Quality for Arabic Speech Based on Three-Layer Neural Networks

Mohamed Hamed

Faculty of Engineering, Port Said University, Port Said, Egypt

Email address

To cite this article Mohamed Hamed. The Processing Analysis of Effective Recognition Quality for Arabic Speech Based on Three-Layer Neural Networks.

American Journal of Computer Science and Engineering. Vol. 5, No. 4, 2018, pp. 88-100.

Received: July 16, 2018; Accepted: July 26, 2018; Published: September 1, 2018

Abstract

This paper studies the rate of recognition (RR) for the Arabic speech depending on various techniques either supervised or

unsupervised learning concepts. It also, studies the accuracy of recognition rate. All networks used in the research are based on

the neural network simulation as a software for supervised / non-supervised training processing is considered. Some words are

selected as the most famous. The concept for recognition depends on a vital point where the segmentation of a word

(utterances) has been divided into a fixed number of segments for any utterance but each one of these segments may include

different numbers of frame intervals. Some words at various sounds were detected and the recognition rate is computed. The

computational time is developed despite the long time required for the processing which reaches some hours at low speed

processing, but this time can be reduced greatly in the practical applications with recent speedy processing systems. The

appeared error during the training phase is developed and illustrated. Results prove a good recognition rate for some words and

the best number of units in the hidden layer of neural networks for Arabic speech recognition is derived through either the

number of sweeps in the training phase or in the actual percentage recognition results.

Keywords

Arabic Speech Recognition Quality, Back Propagation, Neural Network, Segmentation Feature, Computational Training Time

1. Introduction

Many English nouns, verbs, and adjectives begin with a

stressed syllable, and listeners exploit this tendency to help

parse the continuous stream of speech into individual words.

However, the acoustic manifestation of stress depends on the

variety of English being spoken [1]. In two visual world eye-

tracking experiments, some scientists tested if Indian

English-accented speech causes Canadian English listeners to

make stress-based segmentation errors. Participants heard

Canadian- or Indian-accented trisyllabic sequences that could

be segmented in two ways, depending on the perceived

location of stress. They suggested that, Indian English-

accented speech impairs segmentation in Canadian listeners,

and that both accented pitch and other features of the Indian

English accent contribute to segmentation difficulties.

Findings are interpreted with respect to models of how

similarity between two languages impacts the listener’s

ability to segment words from the speech stream [1].

The perception of British English vowels and consonants

by native Saudi Arabic learners of English from a range of

proficiency levels have been studied before [2]. Twenty-six

participants completed consonant and vowel identification

tasks in quiet and noise. This study has predicted difficulties

with vowel perception in production, participants also

recorded vowels embedded in words and read a short story. It

has been determined as a conclusion that, all learners were

better able to identify consonants than vowels in quiet and

noise, with more experienced learners outperforming early

learners [2].

Although learners were likely able to rely on mapping

non-native to native categories when identifying consonants,

American Journal of Computer Science and Engineering 2018; 5(4): 88-100 89

there was some evidence that they had started to establish

new vowel targets. This appeared to start early in learning,

but even highly experienced learners continued to find

vowels with no direct Arabic counterpart difficult. It is well

known that, there is some evidence for a link between

perception and production: vowel perception is always better

in those who had more accurate production. Overall, the

results of the past works shed light on problematic phonemic

contrasts for Arabic learners and suggest that though learners

and trainers may be able to establish new phonetic categories

early in learning, other contrasts continue to remain difficult

even for highly experienced learners [2].

Now the computers have been introduced in many

applications widely even with the military fields so that all of

them encourage others to be directed towards the utilization

of such modern techniques. Also, the dependence on the

typewriting on computers may consume a long time that

must be minimized as possible leading to that many trails

were done specially for English [3].

Some studies investigated many concepts for the general

recognition as whether, and to what degree, late bilinguals of

divergent backgrounds are comparable to native speakers in

the phonetic implementation of tonal targets [4]. Also, these

were related to whether speakers exhibit general patterns of

acquisition irrespective of the typological closeness, and

whether learners’ choice of accent contours and the

alignment of the high tone proceeds in parallel with

proficiency. More specifically, the acquisition of the nuclear

contour composition and the alignment of the American

English (i.e. pitch accent and boundary tone combination)

examined in initial-stressed and final-stressed words by

Japanese and Spanish late bilingual speakers at varying

proficiency levels in American English [4].

This investigation clarified that, Spanish speakers were

more comparable than Japanese speakers to the native

English speakers in the phonological aspect of intonation

(choice of pitch accent contour). In terms of peak alignment,

the late bilinguals generally were tended to realize

significantly later alignment than the native speakers,

although the precise manifestation of this varied according to

the background of speakers and the stress pattern of words

[4]. On the other hand, a small number of researches related

to the Arabic speech can be read and they all help us in our

works, but more work should be implemented to make it easy

to treat the Arabic language [5]. However, we may hear about

the new in the field of computer applications every day not

years due to the concentration in testing and it would be

mentioned that the deduced results are in vast amounts giving

a lot of new applications.

It is unclear whether the association between the visual

attention (VA) span and reading differs across languages.

This relationship in Arabic was studied, where the use of

specific reading strategies depends on the number of

diacritics on words: reading vowelized and non-vowelized

Arabic scripts favor sub-lexical and lexical strategies,

respectively [6]. It was concluded and hypothesized that, the

size of the VA span and its association to reading would

differ depending on individual “script preferences.” Children

who were more proficient in reading fully vowelized Arabic

than non-vowelized Arabic (VOW) were compared to

children for whom the opposite was true (NOVOW).

NOVOW children showed a crowding effect in the VA span

task, whereas VOW children did not. Moreover, the crowding

in the VA span task correlated with the reading performance

in the NOVOW group only. This depends on considering

individual differences on the use of reading strategies in

Arabic [6, 7]. Computers are very effective for the quick

application, especially for the automatic control and the eye

detection or fine checks required either for industry or others.

Also, in the way of computer use we can see the presented

item which is directed towards the Arabic speech for either

the spoken or the written words [7-9].

2. Problem Formulation

It has been well defined that, a speech addressed to adults,

words are seldom realized in their canonical, or citation,

form. For example, the word ‘green’ in the phrase ‘green

beans’ can often be realized as ‘greem’ due to English place

assimilation, where word-final coronals take on the place of

articulation of neighboring velars [10]. In such a situation,

adult listeners readily ‘undo’ the assimilatory process and

perceive the underlying intended lexical form of ‘greem’ (i.e.

they access the lexical representation ‘green’). An interesting

developmental question is how children, with their limited

lexical knowledge, come to cope with phonologically

conditioned connected speech processes such as place

assimilation. Some scientists addressed an issue by

examining the occurrence of place assimilation in the input to

English-learning 18-month-olds [10]. Perceptual and acoustic

analyses of elicited speech, as well as analysis of a corpus of

spontaneous speech, all converge on the finding that

caregivers do not spoon-feed their children canonical tokens

of words. Rather, infant-directed speech contains just as

many non-canonical realizations of words in place

assimilation contexts as adult-directed speech [10].

Previous studies have shown that native English speakers

outperformed non-native English speakers in perceiving

English speech under quiet and noisy listening conditions.

The difference between native English speakers and native

Chinese speakers on using contextual cues to perceive speech

in quiet and multi-talker babble has been tested [11] where 3

types of sentences served as the speech stimuli in: sentences

with high predictability (including both semantic cues and

syntactic cues), sentences with low predictability (including

syntactic cues), and finally sentences with zero predictability

(consisting random sequences of words). These sentences

were presented to native-English and native-Chinese listeners

in quiet and four-talker babble with the signal-to-noise ratio

at 0 and -5 db. Preliminary results suggested that, native

Chinese speakers primarily rely on semantic information

when perceiving speech in quiet, whereas native English

speakers showed greater reliance on syntactic cues when

perceiving speech in noisy situations. The difference between

90 Mohamed Hamed: The Processing Analysis of Effective Recognition Quality for Arabic Speech

Based on Three-Layer Neural Networks

native English speaker and native Chinese speakers on

syntactic, and semantic information utilization in various

listening conditions discussed [11].

The pairwise variability index (PVI) was applied, a rhythm

metric that quantifies variability in speech rhythm, to the

classification of speech varieties [12]. This technique

combines the Particle Swarm Optimization (PSO) algorithm

with a generalization of several rhythm metrics that are based

on the PVI. The performance of this optimization-oriented

classification compared with classification that uses

conventional (both PVI-based and interval-based) rhythm

metrics. Application was made to the classification of native

and non-native Arabic speech using data from the West Point

Arabic Speech Corpus; experiments were based on segmental

durations and use Support Vector Machine (SVM)

classification. Results showed that, the optimization-oriented

classification provides a better discrimination between native

and non-native speech varieties than classification based of

the conventional rhythm metrics. When added to different

combinations of these conventional metrics, the

optimization-oriented procedure consistently improved the

classification rates [12].

Single channel speech separation (SCSS) is widely used in

many real-time applications such as preprocessing stage for

speech recognition to control humanoid robots and in hearing

aid. The performance of the separation is crucial for these

applications. Some researchers proposed an innovative

approach for unsupervised SCSS [13]. The separation relies

on an optimization of the subspace separation by

decomposing the mixed signal into three estimates which

were namely; the sparse subspace, the sub-sparse subspace

and the low-rank subspace. Soft mask used in the core of that

proposed approach for the final decision. The proposed

system generates two separated signals of different qualities

and provided in two different channels [13].

The channel classification is done using Fuzzy logic which

requires two parameters. The first parameter is the quality of

separated signal based on a nonintrusive metric for speech

quality and intelligibility. The second parameter was the

gender of the speaker, determined using a tracking algorithm.

The evaluation results of the proposed approach reported and

compared to others. The proposed method on average

achieves 67.9% improvement in PESQ, 59.5% improvement

in signal-to-interference ratio (SIR) and 10.5% improvement

in the target-related perceptual score (TPS) versus the

benchmark methods [13].

As the neural networks have been developed to the

simulation on computers, the utilization of this advanced

process was meeting with an immense value and still. The

main item in the processing will be the training for such

concept to find the correct reference for measuring the

required word in our case although the more modern

software may not depend on the supervision principle for

these implementations. Thus, the unsupervised and

supervised training techniques would be helpful reaching the

most suitable. In unsupervised phase only, the training

patterns must be registered, and the system will be able to

classify them due to the stored before features [8].

Scientists investigated the use of multi-distribution deep

neural networks (MD-DNNs) for an automatic lexical stress

detection and a pitch accent detection, which are useful for

suprasegmental mispronunciation detection and diagnosis in

second-language English speech [14]. The features used

cover syllable-based prosodic features (including maximum

syllable loudness, syllable nucleus duration and a pair of

dynamic pitch values) as well as lexical and syntactic

features (encoded as binary variables). As stressed/accented

syllables are more prominent than their neighbors, the two

preceding and two following syllables were also taken into

consideration. Experimental results showed that, the MD-

DNN for lexical stress detection achieves an accuracy of

87.9% in syllable classification (primary/secondary/no stress)

for words with three or more syllables. This performance is

much better than those of our previous work using Gaussian

mixture models (GMMs) and the prominence model (PM),

whose accuracies are 72.1% and 76.3%, respectively.

Approached similarly as the lexical stress detector, the pitch

accent detector obtained an accuracy of 90.2%, which was

better than the results of using the GMMs and PM by about

9.6% and 6.9%, respectively [14].

The given research depends on the receiving of the input

word through a special electronic chip to go on the

processing phase as shown in Figure 1 [15]. This is a vital to

find the correct reference word with the help of neural

network to save the computational time needed for such aim.

It should be indicated that the used chip is a standard one

without any interference that put us to explain its

performance as written below.

The sound card, which is known as The Sound Blaster

PRO (SBPro), is a commercial card with its own software

[8]. Its performance and main specifications are listed in

Table A1 (See the Appendix) where it requires at least a

computer model of 286 AT due to the use of 16-bit slot while

XT models may be possible for 8-bit cards. The recorded

sound can be repeated at the original quality so that all

necessary words may be stored and then playback them

individually or at time. Also, the card permits the possibility

to vary the sound intensity according to the volumes for the

purpose either for input or output. This given card is a

suitable module due its perfect and exact company check that

will help us to work with it without any check or examination

and consequently it leads us to concentrate on the speech

signal quality in the given zone of frequency.


Figure 1. A block scheme for the recognition processing system of Arabic

speech.

3. Word Detection

Since the time is accounted at both silence and spoken

phases, the separation between them will be very important.

So, a special technique should be supplied or chosen to get a

line separating the silence zone and the spoken one. This

process is known as the thresholding where the region may

not be silence completely at all due to the presence of noises

or others. Then, the concept of thresholding must be also

suitable to separate small times between words and sentences

where the used words in the tests of training and recognition

processing are listed in Table A2 (See Appendix) [18].

This means the detection for the beginning and the end of

a speech as well as the start moment of a word and its end.

Although this is an important technical point from the

accuracy point of view, the work here will depend on the

known technique in this accordance which is defined by the

Zero Crossing Rate (ZCR). This means that at every moment

the zero-crossing rate for the signal measured should be

calculated as shown in Figure 2 for the Arabic word

expressing “zero” digit [9].

Figure 2 presents a lot of ZCR concentrated together while

there are some little of them at different intervals. This leads

to find and check that the word is located at the time of

concentrated ZCR and noises at the little as illustrated for the

actual signal, shown in Figure 4, where the word energy is

found at the same frame numbers. The others are very small

to be neglected.

Figure 2. The calculated (ZCR) for the word representing "zero".

It must be indicated that, the thresholding is a main

parameter that affecting the accuracy of measurement in

general due to the correct determination for the exact length

of a word. For the tested word in Figure 2, this thresholding

at the word’s start and end points is indicated and

consequently, it is repeated as illustrated in Figure 3 [15].



Figure 3. Energy signals of the word representing "zero".

3.1. Segmentation

It is necessary to state that; a word is represented through

the energy of each voice signal to be passed to the processing

and so each word consumes a certain individual self-length.

This means that, the word length (utterance) will be different

for a word and another one so that a new problem may be

created that would be treated. Then, an utterance depends on

either the speed of speech or the type of word itself and they

varied typically (for Arabic Speech) from 25 to 75 ms while

the number of segments for an utterance would be varied

[15].

However, the normalization concept may be needed to

overcome the problem of variation in the number of features

based on different utterances for words and the known time

warping could be used to give a fixed length for each

utterance. Since the success of time wrapping to solve this

point represents the target, the word duration as input word

length Wlength would be defined as a function of the number

of frames Nframe for the utterance and the number of sections

in each frame Nsections. This can be given mathematically

according to the formula:

Wlength = Nframe × Nsections (1)

Thus, if the number of frames per utterance is kept

constant, the number of features inside each frame will be

changed to keep the word utterance, the input word length, as

a fixed value (unvaried). This technique is used successfully

in the first approach tested in the work [15]. So, the

segmentation process may be a correct way to express the

signals of a word as input to the system to be treated and then

trained by the neural networks.

3.2. Algorithms

Three concepts are used and tested [15] where Figure 4

clarifies the first concept through a block-diagram including

both back propagation (BP) as a second stage to check all

results, even that words escaping from the first stage, that the

so called segmental organizing map (SOM). It is the time to

illustrate that, the software available may be based on the

simple neural network where it consists of three layers (input,

output and hidden in between) with different hidden units in

the hidden layer [15]. This system may depend on the biasing

concept sometimes for acceleration of computational

processing while it is used in the simplest form in researches.

Figure 4. The combined system used in the experiments.

The second and third techniques are the well-known

integrated neural networks (INN) and BP networks [15, 16].

Generally, in all methods the features appear to be the main

item to define a certain word or generally to find a specific

unique feature for each word individually to be different from

any others. This may be either parametric such as linear

predictive coding or non-parametric as time and frequency

domain representation (auto-correlation, energy level or the

illustrated above (ZCR). The deduced feature will be the

marks of the specified word and it should be supplied for

training procedure [18].

The BP will be represented for the supervised training

concept while the unsupervised one should be the SOM

method. The BP depends on the continuous correlation for

the deduced error during processing and propagation back

until the error becomes within the permissible (predefined)

limits. In the unsupervised method, the output node with

minimum distance from the training pattern is determined

and it changes the weights randomly to measure the

difference between weights. This can be placed on a map to

find the overall distribution for nodes of a network.

The nodes of the output layer must be divided into

different areas (nodes) related to a certain word (words). The

classes of similar word would be introduced in the same area

leading to the failure of SOM to recognize. Hence, another

stage should be connected to stop the overlapping

phenomena and to separate the similar cases. Thus, the error

is measured until the results reach the pre-specified value of

error.

The speaker independent principle may be needed for the

generalization of ultimate results and consequently the

conclusion. Two types of groups of speakers are included

where they have distinctive characteristics of tone, clear and


intensity. This is considered for the selected sample rate of 10

kHz with 8 bit / sample at linear predictive coding of features

[16, 17]. The general classification of sample tests for the

experimental treatment can be deduced as listed in Table A3

in the Appendix.

A three-layer neural network is defined as the input layer

having 156 nodes to receive the input signal and so the

output layer will be varied according to the type of input

group. This means that, it will be only ten units if the input is

either the fundamental ten digits (0, 1, 2, 3, 4, 5, 6, 7, 8, and

9) or the ten short chosen commands while it would contain

28 nodes (the equivalent number for the Arabic alphabet) if

the input is the alphabet characters. It should be indicated

that, the considered punctuation marks are only four.

Otherwise, the internal layer will be varied for examination

where the units inside the hidden layer will be changed

between 10 and 50. It must be remarked that, increasing the

number of hidden units causes the over-fitting so that it may

lose its effectiveness, taking much effort and time for

processing. Also, the ten fundamental digits (0 – 9) are

treated with respect to the variation in the number of units

(20, 30, 40) in the hidden layer as illustrated in Table A4 (See

Appendix) [15].

It seen that the maximum overall recognition is appeared

at 40 units in the hidden layer with the BP network so that its

value may be taken as a reference for the next deduced

results. Therefore, the above experiment is repeated for the

fundamental alphabet characters of the Arabic language (28

characters as listed in Table A2 in Appendix) at the same

region of hidden units (Table A5 in Appendix) [18].

4. Results

Therefore, since the training of isolated words and spoken

Arabic words expressing the basic digits (0, 1, 2, 3, 4, 5, 6, 7,

8, and 9) for a single speaker is based on segmental BP

where the results will depend on the number of segment

covering frame length (utterance). For small number of

segments, the network may not be able to extract the correct

features but contrary with sizable number of segments, the

possibility to represent exactly the word will be increased.

This phenomenon is studied as shown in Figure 5 where

maximum and minimum frame lengths are considered to

evaluate the suitable number of segments when the hidden

layer contains 50 units [15]. Figure 6 indicates that; 15

segments condition may be the most suitable choice for the

tested sample.

4.1. Recognition Rate (RR)

It is needed to illustrate that, the results depend always on

the sounds of speakers relative to applications, but the

official Arabic language has been implemented for all

trainings and tests. This may facilitate the recognition

accuracy since the spoken characteristics may be different

from a country to another. This phenomenon is highly

presented in the Arabic Language although its written

wording is the same between all Arab countries.

Figure 5. Recognition percentage for different segments.

Consequentially, the effect results of tested number of

patterns is checked experimentally as registered in Figure 6

where 5 patterns are the most valid for the recognition rate

determination reaching the highest value.

Figure 6. The recognition dependency on the number of patterns.

The results of Figure 6 illustrate the recognition quality for

the tested segments of wording and prove the validity of

equation 1. Then, the recognition rate RR for 50 utterances of

the chosen words representing separate wording groups such

as digits, alphabet, short commands and the punctuation

marks has been investigated [15]. The ultimate results of

experiments for the rate of recognition are given in Figure 7

where the recognition rate RR of digit wordings is appeared

less than other test samples while the short commands came

at the best position of RR.

Since the spoken digits wording gave the worst level for

the recognition rate (RR) among others, it may be necessary

to find the recognition accuracy in a sequential degree

system. Whatever, the results of experiments for singular

speaker are repeated for both first and second groups of

speakers have been summarized in average values obtained

with respect to the variation in the number of units in the

hidden layer. Then, these calculated average computational

times and recognition rates for both conditions (singular and



groups) of speakers as varied w.r.t. the number of hidden

units are developed as ultimate results as shown listed in

Table A6 (See Appendix) for the consumed processing time

when the test of spoken words expressing the digits is

applied.

Figure 7. The average values of RR.

However, the comparison between results for

computational time can be simplified by the transformation

into the per unit system (Figure 8) although maximum and

minimum points can be derived from Table A6 of the

Appendix.

Similarly, the Average Recognition Rate (%) for the

sample of spoken words expressing digits in Arabic speech

language is listed in Table A6 in the Appendix. The

maximum recognition rate is 93.33% although diverse teams

were spoken. Also, the difference between classified groups

is the single speaker and two groups where children and

adults are acted the test [15]. The results of Table A6 in the

Appendix contain two principle measurements for the

computational processing to reach a high-quality recognition

so that these results may be investigated individually. So, the

recognition part would be delayed after the study of

computational processing time where it represents a vital

item for some applications. Its modification can be reached

through more efficient processors so that the deduced results

may be analyzed.

Figure 8. The per unit processing time dependency.

The per unit computation will clarify the study for a good

analysis where the estimated values in per unit are tabulated

in Table 1. The base value for the per unit system would be

the minimum value which it is 3.8730556.

Table 1. The per unit computational time.

Number of

Hidden Units Single speaker

First

group Second group

10 2.3679265 1.078821 1.01471

20 2.7225119 1.2367496 1

30 3.560998 1.5729039 1.1541

40 2.9687298 1.546367 1.8759951

50 4.10184318 2.047407 2.2237682

Back to Table A6 (See Appendix) it may be noticed that,

the recognition rate for singular speaker is very high relative

to the grouping style due to the inclination in the reference

features obtained for computational processing although the

first group gives a relative higher recognition rate RR [15].

The persons inside this group have the same characteristics

which specify the same region of sound waves. The results

are translated into curves as drawn in Figure 9 although the

second group indicates the lower recognition rate (RR). They

are different in the overall performance of each individual

one as a single member in the group, but the deduced values

could be varied again with introducing another person in the

group, instead or adding [15].

The maximum value for the rate of recognition appears for

single speaker at the level of 93.33% while these rates for

first and second groups are 86.1% and 84.2%, respectively.


Also, this maximum value for the marks recognition appears

to be 90% as shown in Figure 9 while maximum and

minimum points for each category of the tested words are

shown.

It should be mentioned that, if the system cannot

discriminate between two words, the BP concept is inserted

to separate them [15, 17]. This condition has been found for

the spoken words representing the two digits (4) and (9) with

a system consisting of 10 input units and 2 output units [18].

This is based on the varied hidden number of units within 5

and 15 and then, the average per unit recognitions for the

selected sample of input words, expressing the fundamental

characters referring to the above maximum value, have been

estimated. The results of experiments are listed in Table 2

where the best recognition appears at 30 hidden units. Thus,

the choice of 30 hidden units will be the most efficient.

Figure 9. The deduced recognition rate for tested samples.

It is noticed that, the selected sample of Arabic words

consists of four categories (digits, alphabet, a few short

commands and some punctuation marks) where the same

words have been tested for the recognition ability.

Accordingly, all these words are examined through the three

used methods for the rate of recognition. It is seen that; the

recognition rates of commands are the best for SOM + BP

method (Figure 4) where it gives 99.33% recognition. This

vital result is the best between that for other two methods so

that it can be recommended to be applied for the robotic

systems, depending on the commands. Also, the concept of

[15] has the maximum RR to be preferred although

sometimes it gives the lower recognition rate.

Table 2. The average P U recognition (Tested Characters).

No. of hidden

units group INN

word in

group INN

final

BP

network

BP

20 0.9357 0.9715 0.8571 0.6499

30 0.9536 0.9929 0.8928 0.6648

40 0.9500 0.9892 0.89280 0.6643

It must be remarked that, the computational time for the

recognition processing is high and consequently it must be

minimized as possible by the parallelism processing. This

phenomenon leads to transfer the stereo sound signal

(double-channel system) into mono (single-channel system)

before processing to reduce the computational time and

effort.

This principle is like that corresponding to color images

that replaced by the gray pictures unless it is required [17].

Also, this may be useful if there are words at the same node

and there is no way to specify them. Then, the return to

stereo channels will be necessary to reach efficient

recognition. Thus, the mono implementation is proposed for

quick requirements.

4.2. Accuracy

It is deduced from Table A5 (See Appendix) that, the

recognition rate is decreased for the fundamental characters

of Arabic language than that for digits. This may be appeared

due to the raise in the number of input data relative to the

digits and so to the map distribution where the possibility of

sharing between words here may highly increase. This means

that, for the large groups of words, the grouping technology

should be introduced to minimize the computational time and

effort. Therefore, the performance of the required numbers of

sweeps in the process of training is drawn in Figure 10 for

both the digit words and for the commands. For digit

wording it is seen that, the number is decreasing

approximately as exponential while it is vibrated for

commands with some minimum numbers in the region

between 20 and 40 units in the hidden (internal) layer of the

network. Whatever, the grouping effect is developed for that

curve to indicate that the system of grouping makes the

variation in a smooth manner than that without.

On the other side, the accuracy of recognition which can

be specified through the permissible error is a principal factor

and so it is given in Figure 11 where the curves prove that the

error can be neglected at all as its value approaches the zero

value practically. It is seen from this figure that the minimum

error is always with the use of 40 numerous for the hidden

layer of the typical 3-layer neural networks. Although all

results of the calculated error are very small, the error for the

processing of short commands appears to be the minimum

one due to the short utterance for orders in general.

However, the results present the lowest number of sweeps

for group although it must be the largest. Also, the commands

give a low margin of variation with the low sweeps appear at

the 50 units in the hidden layer, but the maximum number of



sweeps comes with 10 units. It should be mentioned that; this

conclusion is determined for the three-layer neural networks.

Contrary, the digit ultimate results present the largest number

of sweeps during the training processing while its maximum

reaches near 700 and minimum at about 200 with 40 units in

the hidden layer. The variation in the sweeping characteristics

may be appeared due to the different voices and the variation

in utterances so that the training processing can be accurate

with practice. This phenomenon has been occurred because

the Arabic speech is varying in a spread scale.

Figure 10. The deduced number of sweeps during training.

It can be concluded from Figure 3 that; the training phase

is a fundamental step in the processing of speech recognition

in general, but it is very important for the Arabic speech

recognition. The importance here deals with the difficulty in

the accent which can be expressed in very different ways.

Then, the training phase may be applied many times to get

the required accuracy for speech recognition. Thus, the

commands (orders) or digital or even alphabetic wording can

be treated in the same manner although the conclusive results

above showed some different between them.

On the other hand, the grouping system may deliver a

harder difficulty due to the great variation in the style of the

group or the sexual male or female or a combined or even

mixing of children, male and female together. In all cases, the

training results will direct the real state for the Arabic speech

recognition so that the quick recognition can be achieved.

Figure 11. The error during training.

Figure 11 presents the error appeared in the results for the

tested sample of digit and alphabet wording where Figure 12

gives the same conditions for Marks and short commands a

very small error is seen for both the marks and commands.

The shown error is so small so that it may be neglected for all

similar cases. Also, the number of hidden units in the neural

network deduces the most accurate number where it is 20

units for both marks and commands. This means that, the 20

units in the hidden layer should be proposed for application

in this field. Otherwise, a larger error in percentage

evaluation is appeared for digits and alphabet (Figure 11) in

Arabic speech where the variation is shown within the same

margin almost (0.001–0.02) according to the number of units

in the hidden layer. It should be mentioned that, the error of

both is different for the defined number of units in the hidden

layer which means that the recognition of either the digits or

the alphabet differs from each other. It must be noted that all

derived values are deduced from the data of Reference [18].

Figure 12. The error during training (Marks & Commands).


Since the alphabet words gave the worst RR among the

others in the region of 10-40 hidden units, it is necessary to

find the recognition accuracy for them as driven in Figure 13

in the sequential degree system. Whatever, the experiments

for singular speaker and the same for the first group of

speakers as well as the second group have been summarized

in average values obtained with respect to the variation in the

number of units in the hidden layer. The developed ultimate

results are listed in Table A6 (See Appendix).

This Table A6 shows that, the recognition rate for singular

speaker is very high relative to the grouping system due to

the inclination in the reference features obtained for

processing while the first group gives a higher RR since the

persons inside this group have the same characteristics which

specify the sound waves. The second group indicates the

lower RR because they are different in the overall

performance of each individual one as a member in the

group, but these deduced values could vary again when

another person in the group is introduced.

Figure 13. The recognition accuracy of Arabic alphabet.

Nevertheless, the illustrated results in Figure 13 proves

that there are two wordings have lower level for the

recognition and some has a middle rate for recognition.

Contrary, the almost wordings have a prominent level for

recognition. It should be mentioned that if the system cannot

discriminate between two words, the BP concept is used to

separate them. This condition has been found for the spoken

words representing "4" and "9" leading to study the case with

a system consists of 10 input units and 2 output units where

the hidden layer has a varied number of units in the region

between 5 and 15. The results of experiments are listed in

Table 3 showing that the best recognition appears at both 8

and 10 hidden units but the computational time for 10 units is

less than that for 8 units.

Table 3. The data for words, expressing the digits 4 and 9.

No. of hidden units Computational time (m) Average RR (%)

5 6.266 93.33

8 5.833 96.67

10 3.4166 96.67

15 4.183 93.33

So, the choice of 10 units will be the most suitable. All

these words are examined through the three used methods for

the rate of recognition where the determined results are

tabulated in Table A7 (See Appendix) to show the best

method. From this Table A7 it is seen that, the commands are

the best in all methods so that it can be recommended for use

with robotic systems depending on the commands. Also, the

new concept of [15] has the maximum RR between them and

this leads us to prefer it for the purpose but sometimes its

results may be less than the others [19]. This concept has a

multi-purpose application benefits as given for example in

[20].

It must be remarked that, the computational time for the

recognition processing is high and consequently it must be

minimized as possible by the concept of parallelism. This

phenomenon leads to transfer the stereo sound signal into

mono before processing to reduce the computational time to

the half due to the work with only one channel instead of

two. This principle is like that corresponding to the color

images that replaced by the gray pictures unless it is required

[17]. This may be useful if there are words at the same node

and there is no way to specify them, then the return to stereo

channels will be efficient.

5. Conclusion

Since the target aim an exact and accurate testing for the

determination of Arabic speech, the proposed mathematical

product for the segmentation of utterances for words

represents a good appropriate tool to express, exactly and

easy, the word length. Thus, the accuracy of processing for

Arabic speech recognition with neural networks goes up.

However, as the recognition rate depends on the quality of

training phase, a significant analysis for the testing phase

with different varieties of voices must be inserted, in general.

Thus, the implementation for several types of voice (male

and female) as well as ages (Children, Adults and Old) may

be required so that the training concept must take the high

importance target. This means a principle requirement for

Arabic speech.

It is recommended that the recognition rate for words go

up with the combination of neural concepts if the target is

high quality of recognition. Thus, the per unit analysis

presented in this research proves that the 3-layer neural

networks are suitable and quite enough for the effective

Arabic speech recognition. Integrated neural networks (INN)

for Arabic speech recognition are recommended for

applications in this field with a high percentage rate for the

recognition.

The multi speaker system reduces the processing time for

the rate of recognition for words in Arabic speech although it



combines two fundamental components for the neural

networks. However, the range of (20-40) unites in the hidden

layer of the network gives the good limits for selection. Thus,

numerous hidden units will consume an excess

computational time and effort.

The computational time required for the recognition

training of Arabic speech is relatively large in general and

thus it needs the use of parallelism in the computer circuits

for processing. Therefore, since the output units depend on

the input data, they may be grouped for minimization.

The concept of time wrapping to solve the problem of

normalization techniques is a vital tendency, to overcome the

variation in the number of features for different utterances.

Then the normalization technique can be successfully used.

However, the processing of stereo sounds needs a long-

time length so that the transformation into the mono sound

mode before recognition processing will be effective for

implementation. Thus, the stereo sound is not required as

processing media where it will be transferred into mono. This

transformation reduces greatly the computational effort

besides the computational time. Otherwise, the recognition

processing of Arabic speech needs the use of parallelism to

minimize the computational time.

The multi speaker system increase the rate of recognition

for words in Arabic speech because it accounts the wide

spread field of tones. Whatever, the parallel processing

technology is the best way to apply the speech recognition in

real time at any possible actual application to reduce the

processing time.

The three-layer neural networks (with only 40 hidden

units) are quite enough to recognize the Arabic speech.

The proposed concept is recommended to be implemented

in other fields such as early recognition and detection of

pathological speaker in the biomedical field and other

relevant items.

Acknowledgements

The author would like to express his appreciation and

thanks to D Wafik, The Higher Institute of Engineering – The

Tenth of Ramadan City – Egypt, for her good help and high

support in the processing of Data.

Appendix

Table A1. The specifications of audio card [18].

Item value

A/D input mono 4.23 kHz

A/D output mono 4.44 kHz

sample resolution 8 bits

sample inputs (mono) 1- mic 2- on line

music / (FM – CMS) 11-voice mono / upgradable

CD ROM connector Yes

Amplifier 4 W, 4 Ohms

Table A2. The used words in processing [18].

Alphabet words Digits Commands Marks

ENGLISH ARABIC ENGLISH ARABIC ENGLISH ARABIC ENGLISH Word Mark Word

Alef أ�ف Dad ر 0 ��د go ط� (.) إذھب�� Baa ء�� Dah وا�د 1 ط� stop و�ب (+) �ف� Taa ء�� Zah ن 2 ظ��ا � start ب (=) إ�دأ��! Thaa ء� Ean ن�3 " � # right ن��ل (،) �& Geem م�� Ghean ن�أر�(� 4 ) left ر�!�

Hah �� Faa 5 &�ء �!�* up +,"إ

khah �* Kkaf 6 ��ف ��! down لإ!

Dal دال Kaf 7 -�ف �)�! come ل�)�

Zal ذال Laam 8 .م �� open /� إ&

Reh راء Meem م�9 � �)!� close إ),ق

zean ن�ز Noon ون�

Seen ن�! Heh ��ھ

Sheen ن�2 Waw واو

Sad د� yeh ��

Table A3. The samples of input words [18].

phase No. of Digits Alphabet commands marks

training

utterances 50 224 100 20

words 10 28 10 4

utter/word 5 8 10 5

recognition

utterances 150 420 150 40

words 10 28 10 4

utter/word 15 15 15 10

total utterances 200 644 250 60


Table A4. The average recognition for fundamental digits [18].

No. of hidden units INN BP

Group word in group final network

20 90.00 99.05 90.67 91.33

30 90.67 99.05 91.33 89.33

40 92.67 99.05 92.00 93.33

Table A5. PUav recognition for fundamental characters [18].

No. of hidden units INN BP

group word in group final network

20 0.9357 0.9715 0.8571 0.6499

30 0.9536 0.9929 0.8928 0.6648

40 0.9500 0.9892 0.89280 0.6643

Table A6. The average time and RR of singular and groups of speakers for digits [18].

Number of Hidden Units Computational Time (h: m: s) Average RR (%)

Single First Second Single First Second Marks 10 09:10:16 04:10:42 03:55:48 88.00 84.00 80.45 90 20 10:32:40 04:47:24 03:52:23 91.33 86.10 81.00 90

30 13:47:31 06:05:31 04:28:11 89.33 81.12 83.15 87.5

40 11:29:53 05:59:21 07:15:57 93.33 86.00 84.20 87.5

50 15:53:12 07:55:47 08:36:46 91.33 80.21 81.33 87.5

Table A7. The RR for different methods [18].

word type BP- network INN SOM+BP

digits 93.33 92.00 94.00

characters 72.86 80.00 78.57

commands 98.00 ---- 99.33

References

[1] Kara Hawthorne, Juhani Järvikivi & Benjamin V. Tucker (2018): Finding word boundaries in Indian English-accented speech, Journal of Phonetics, Volume 66, January 2018, (145–160), http s://doi.org/10.1016/j.wocn.2017.09.008

[2] Bronwen G. Evans & Wafaa Alshangiti (2018): The perception and production of British English vowels and consonants by Arabic learners of English, Journal of Phonetics, Volume 68, May 2018, (15-31), https://doi.org/10.1016/j.wocn.2018.01.002

[3] T. Kohonen (1988): The neural phonetic typewriter. IEEE on Computer, Vol. 21, No. 3, (11-22).

[4] Calbert Graham & Brechtje Post (2018): Second language acquisition of intonation: Peak alignment in American English, Journal of Phonetics, Volume 66, January 2018, (1–14), https://doi.org/10.1016/j.wocn.2017.08.002

[5] Elizabeth K. Johnson, Amanda Seidl & Michael D. Tyler (2014): The Edge Factor in Early Word Segmentation: Utterance-Level Prosody Enables Word Form Extraction by 6-Month-Olds, https://doi.org/10.1371/journal.pone.00835464), https://doi.org/10.1016/j.wocn.2017.08.002

[6] Marie Lallier, Reem Abu Malloih, Ahmed M. Mohammed, Batoul Khalifa, Manuel Perea & Manuel Carreiras Basque (2018): Does the Visual Attention Span Play a Role in Reading in Arabic? Scientific studies of reading Journal, Volume 22, issue 2, 2018, https://doi.org/10.1080/10888438.2017.1421958

[7] Charles Hulme and Margaret J. Snowling (2014): The interface between spoken and written language: developmental disorders, Philos Trans R Soc Lond B Biol Sci. 2014 Jan 19; 369 (1634): 20120395. DOI: 10.1098/rstb.2012.0395.

ttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866425/

[8] A. Stolz (1993): The Sound Blaster Book. Abacus, MI, USA.

[9] Mathias Barthel, Sebastian Sauppe, Stephen C. Levinson and Antje S. Meyer (2016): The Timing of Utterance Planning in Task-Oriented Dialogue: Evidence from a Novel List-Completion Paradigm, December 2016, https://doi.org/10.3389/fpsyg.2016.01858 https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01858/full

[10] Helen Buckler, Huiwen Goy & Elizabeth K. Johnson (2018): What infant-directed speech tells us about the development of compensation for assimilation, Journal of Phonetics, Volume 66, January 2018, (45-62), https://doi.org/10.1016/j.wocn.2017.09.004

[11] Ling Zhong & Chang Liu (2018): Speech Perception for Native and Non-Native English Speakers: Effects of Contextual cues, The Journal of the Acoustical Society of America, Volume 143, 2018, https://doi.org/10.1121/1.5036397

[12] Soumaya Gharsellaoui, Sid Ahmed Selouani, Wladyslaw Cichocki, Yousef Alotaibi & Adel Omar Dahmane (2018): Application of the pairwise variability index of speech rhythm with particle swarm optimization to the classification of native and non-native accents, Journal of Computer Speech & Language, Volume 48, March 2018, (67-79), https://doi.org/10.1016/j.csl.2017.10.006

[13] Belhedi Wiem, Ben Messaoud, Mohamed anouar, Pejman Mowlaee and Bouzid Aicha (2018): Unsupervised single channel speech separation based on optimized subspace separation, Journal of Speech Communication, Volume 96, February 2018, (93-101), https://doi.org/10.1016/j.specom.2017.11.010



[14] Kun Li, Shaoguang Mao, Xu Li, Zhiyong Wu & Helen Meng (2018): Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks, Journal of Speech Communication, Volume 96, February 2018, (28-36), https://doi.org/10.1016/j.specom.2017.11.003

[15] R. E. Atta (1996): Arabic Speech to Text Translator. M. Sc. Thesis, Suez Canal University, Port Said, Egypt, pp. 162.

[16] Debbie Greenstreet & John Smrstik (2017): Voice as the user interface – a new era in speech processing, May 2017 (1–9), http://www.ti.com/lit/wp/slyy116/slyy116.pdf

[17] M. Hamed (1997): A quick neural network for computer vision of gray images. Circuits, Systems & Signal Processing Journal, USA, Vol. 16, No. 1. https://link.springer.com/content/pdf/10.1007/BF01183174.pdf

[18] Mohamed Hamed & Dalia Wafik: A Multi-Speaker System for Arabic Speech Perception, accepted, Open Science Journal of Electrical and Electronic Engineering, Paper No. 7350160, 2018, Vol. 5, No. 2, 2018, pp. 11-17., Received: April 9, 2018; Accepted: May 4, 2018; Published: July 5, 2018, http://www.openscienceonline.com/journal/archive2?journalId=735&paperId=4309

[19] KP Braho, JP Pike, LA Pike: US Patent 9,928,829, 2018 : Methods and systems for identifying errors in a speech recognition system.

[20] Tobias Hodgson, Farah Magrabi & Enrico Coiera: Evaluating the usability of speech recognition to create clinical documentation using a commercial electronic health record, International Journal of Medical Informatics, Volume 113, May 2018, Pages 38-42, https://doi.org/10.1016/j.ijmedinf.2018.02.011

the processing analysis of effective recognition quality

Documents