[ieee 2012 ieee symposium on industrial electronics and applications (isiea 2012) - bandung,...

5
The Contribution of Prosody to the Identification of Persian Regional Accents Ali Gholipour, Mohammad H. Sedaaghi and Mousa Shamsi Sahand University of Technology Sahand, Tabriz, Iran {a_gholipour, sedaaghi, shamsi}@sut.ac.ir Abstract—This paper focuses on the contribution of prosody to the identification of some Persian regional accents. In this way, we measured prosodic features including rhythm-related features and global statistics on pitch contour, energy contour and their derivatives. Then sequential forward feature selection is employed to identify the most discriminant attributes to classify speakers according to their accents. Major identified accent- specific features are related to the derivative of pitch contour. For this purpose, we have recorded a corpus containing speech from different Persian accents, including Turkish, Kordish, Tehrani, Isfahani and Mazandarani. Also to understand human ability to detect accents, a perceptual experiment has been arranged. The automatic accent classification rate using the identified best features and employing a SVM classifier is obtained and compared to human performance. Keywords- accent classification; Persian accents; perceptual experiment; prosody; feature selection; I. INTRODUCTION Accent classification, is an emerging topic of interest in the speech recognition since accent is one of the most important factors next to gender that influences automatic speech recognition (ASR) performance. The ASR performance reduces when accent of a speech sample differs from the training samples. So, correct identification of speaker's accent can improve the performance of ASR systems. In linguistics, accent is a specific pattern of pronunciation of people who belong to a particular nation or geographical location while dialect refers to distinctive vocabulary, pronunciation and grammar. Often accent is a subset of dialect [1]. The differences between accents can be attributed to five types of characteristic: Differences in the number or identity of the Phoneme inventory of the accents Differences in the lexical representation of particular words. Differences in the phonotactic distribution; A specific phoneme may have different phones based on the phonetic context. Differences in prosody of accents. Differences in the phonetic realizations of accents; Due to the differences, in the configurations, the positioning, tension and movement of the producing sound organs. Prosody is the pattern of rhythm, stress and intonation of speech. In this paper, the differences of prosody of accents are used to purpose of accent identification. There are a few researches on accent identification, especially in non-English languages, in literatures. Most of these researches rely on acoustic features such as Mel- frequency cepstral coefficients (MFCCs), energy and their first and second derivatives and some pattern classifiers like HMM, SVM and neural networks [2-4]. Also analysis of formant frequencies, mainly the first and second formants for the most common standard vowel sounds of a language, is used for accent classification in [5, 6]. There are other approaches that rely on statistical features such as duration of vowels and consonants and speaking rate (phonemes per second) variation [1]. Dimensions such as rhythm and intonation also contribute to reveal a speaker’s accent. In this approach, [7] verified that the initial rise or the final fall/rise in intonation contours is an indicator of the differences in accents. Reference [8] considers the proportion of vocalic intervals and the duration variation of consonantal intervals to account for speech rhythm variation. On the suprasegmental level, researchers showed that the mother tongue prosody tends to persist in non-native speakers [8]. So, Prosodic features play important roles to find out a speaker’s accent. Prosody is the pattern of rhythm, stress and intonation of speech. The parameters for the modeling of prosody of an accent are: pitch contour, speaking rates and intensity. In this paper, we concentrate on the prosody of some Persian regional accents including Turkish, Kordish, Tehrani, Isfahani and Mazandarani, which have most speakers in Iran. In this way, we have recorded a corpus containing read speech of these accents, over 960 utterances from 40 different speakers. To investigate, we extract some statistical features of pitch and intensity contour and voiced regions. The pattern of speech intonation is modeled by pitch contour. To investigate the intonation, first we estimate the pitch contour for the voiced regions of speech. Then we extract some statistics of this contour and its derivative. We do 2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia 978-1-4673-3005-3/12/$31.00 ©2011 IEEE 346

Upload: mousa

Post on 01-Mar-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

The Contribution of Prosody to the Identification of Persian Regional Accents

Ali Gholipour, Mohammad H. Sedaaghi and Mousa Shamsi Sahand University of Technology

Sahand, Tabriz, Iran {a_gholipour, sedaaghi, shamsi}@sut.ac.ir

Abstract—This paper focuses on the contribution of prosody to the identification of some Persian regional accents. In this way, we measured prosodic features including rhythm-related features and global statistics on pitch contour, energy contour and their derivatives. Then sequential forward feature selection is employed to identify the most discriminant attributes to classify speakers according to their accents. Major identified accent-specific features are related to the derivative of pitch contour. For this purpose, we have recorded a corpus containing speech from different Persian accents, including Turkish, Kordish, Tehrani, Isfahani and Mazandarani. Also to understand human ability to detect accents, a perceptual experiment has been arranged. The automatic accent classification rate using the identified best features and employing a SVM classifier is obtained and compared to human performance.

Keywords- accent classification; Persian accents; perceptual experiment; prosody; feature selection;

I. INTRODUCTION Accent classification, is an emerging topic of interest in the

speech recognition since accent is one of the most important factors next to gender that influences automatic speech recognition (ASR) performance. The ASR performance reduces when accent of a speech sample differs from the training samples. So, correct identification of speaker's accent can improve the performance of ASR systems.

In linguistics, accent is a specific pattern of pronunciation of people who belong to a particular nation or geographical location while dialect refers to distinctive vocabulary, pronunciation and grammar. Often accent is a subset of dialect [1]. The differences between accents can be attributed to five types of characteristic:

• Differences in the number or identity of the Phoneme inventory of the accents

• Differences in the lexical representation of particular words.

• Differences in the phonotactic distribution; A specific phoneme may have different phones based on the phonetic context.

• Differences in prosody of accents.

• Differences in the phonetic realizations of accents; Due to the differences, in the configurations, the

positioning, tension and movement of the producing sound organs.

Prosody is the pattern of rhythm, stress and intonation of speech. In this paper, the differences of prosody of accents are used to purpose of accent identification.

There are a few researches on accent identification, especially in non-English languages, in literatures. Most of these researches rely on acoustic features such as Mel-frequency cepstral coefficients (MFCCs), energy and their first and second derivatives and some pattern classifiers like HMM, SVM and neural networks [2-4].

Also analysis of formant frequencies, mainly the first and second formants for the most common standard vowel sounds of a language, is used for accent classification in [5, 6].

There are other approaches that rely on statistical features such as duration of vowels and consonants and speaking rate (phonemes per second) variation [1].

Dimensions such as rhythm and intonation also contribute to reveal a speaker’s accent. In this approach, [7] verified that the initial rise or the final fall/rise in intonation contours is an indicator of the differences in accents. Reference [8] considers the proportion of vocalic intervals and the duration variation of consonantal intervals to account for speech rhythm variation.

On the suprasegmental level, researchers showed that the mother tongue prosody tends to persist in non-native speakers [8]. So, Prosodic features play important roles to find out a speaker’s accent. Prosody is the pattern of rhythm, stress and intonation of speech. The parameters for the modeling of prosody of an accent are: pitch contour, speaking rates and intensity.

In this paper, we concentrate on the prosody of some Persian regional accents including Turkish, Kordish, Tehrani, Isfahani and Mazandarani, which have most speakers in Iran. In this way, we have recorded a corpus containing read speech of these accents, over 960 utterances from 40 different speakers. To investigate, we extract some statistical features of pitch and intensity contour and voiced regions.

The pattern of speech intonation is modeled by pitch contour. To investigate the intonation, first we estimate the pitch contour for the voiced regions of speech. Then we extract some statistics of this contour and its derivative. We do

2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia

978-1-4673-3005-3/12/$31.00 ©2011 IEEE 346

something like this for energy contour to describe the intensity of a speech. Also, by estimation speaking rate by inverse of the average length of the voiced parts and calculation of voicing rate, the rhythm patterns of speech is described.

After feature extraction, sequential forward feature selection (SFS) algorithm is employed to identify which features are most relevant for automatic accent identification. This stage also aims at deleting unsuitable attributes to improve the performance of learning algorithm. Experiments for purpose of classification are done using three classification techniques include support vector machines (SVM), probability neural network (PNN) and K-nearest neighbors (KNN). Finally, classification results achieved with the best feature set are reported and compared to human perception.

The paper is organized as follows. Section 2 describes the corpus used in this work and presents the perceptual test and corresponding results. Structure of the proposed algorithm is described in Section 3 also employed feature selection technique and the classifiers are discussed in this section. Section 4 presents the experimental results. Finally, Section 5 concludes the paper.

II. ACCENTED SPEECH DATABASE For this study, a speech corpus of more than 150 minutes

was collected, including read speech of different accents. It was recorded in an acoustic anechoic room in the “Artificial Intelligence and Information Analysis” lab, department of Electrical Engineering, Sahand University of Technology, Tabriz, Iran. It is called Sahand Accented Speech (SAS) database. All the speech files were sampled at 8 kHz.

SAS includes five Persian accents: Turkish, Kordish, Tehrani, Isfahani and Mazandarani, which TRK, KRD, TEH, ISF, MZN are their abbreviations, respectively.

This collection includes 40 speakers. Eight speakers (as many males as females) were recorded for each accent. All the speakers were the first-year students, and their age ranged from 18 to 28 years. Each speaker produced 24 varied utterances. This yielded total of 960 utterances for SAS database. The whole set of read speech recordings were checked for reading error.

A. Perceptual Experiment Perceptual experiment was conducted to determine human

ability for accent identification. Until now, it is not entirely clear, how the human can quickly identify speaker’s native language [8]. The investigation of the impact of the listener’s accent background, confirms that it improves their perception and comprehensibility [9].

The perceptual experiment was implemented by employing 18 listeners in SAS database. The listeners do not have any familiarity with the speakers.

The experiment involved a forced choice between five possibilities: TRK, KRD, TEH, ISF and MZN. For testing, the 960 speech samples was played randomly for each listener and asked them to identify the speakers’ mother tongues. After playing each sample for a listener, the sample is removed from

collection, so the listener has only one chance to hear each sample.

The results of this experiment are shown in Tables, I and II. Table I shows the average of sentence-depend accent identification rates. As shown in Table I, third and fourth sentences have the least discrimination ability. Of course, it is related to the shortest sentences in the database.

TABLE I. PERCEPTUAL IDENTIFICATION RATE FOR EACH SENTENCE(%)

Sent. accuracy Sent. accuracy Sent. accuracy 1 70.2 9 67.7 17 76.9 2 67.5 10 83.9 18 86.6 3 42.7 11 74.4 19 76.9 4 27.5 12 84.7 20 84.1 5 68.8 13 83.8 21 80.2 6 70.8 14 88 22 84.4 7 71.3 15 78 23 83.6 8 78.6 16 76.3 24 81.9

Table II shows the results of the perceptual identification test related to each accent. These results show that listeners have high ability to identify Turkish accent and low ability to identify Kurdish. Furthermore, the results show that the confusion between Kurdish and Mazandarani accent is more than others.

TABLE II. CONFUSION MATRIX OF PERCEPTUAL IDENTIFICATION TEST (%). ROWS CORRESPOND TO THE REFERENCE WHILE COLUMNS GIVE THE

SUBJECTS’ ANSWERS (WITH MAJORITY ANSWERS IN BOLDFACE).

TRK MZN KRD TEH ESF

TRK 88.2 2.9 2.6 4.5 1.6 MZN 2.3 75.6 15 2.9 3.9 KRD 8.6 19.2 68.3 2.6 1.1 TEH 8.4 4.3 2.9 79.8 4.3 ESF 6.1 6.8 3.6 11.5 71.7

Average 76.7

III. ALGORITHM STRUCTURE The structure of the proposed algorithm is shown in Fig. 1.

It operates in four main stages:

Pre-processing: includes segmentation speech into overlapping windowed frames, SUV detection by determining thresholds for zero crossing rate and energy.

Feature extraction: estimate pitch and energy contour, extract statistics of their smoothed contours and their derivatives, statistics of voiced/unvoiced regions.

Feature selection: employing sequential forward feature selection algorithm.

Classification: Includes the classification using Support Vector Machines and Probabilistic Neural Network and k-nearest neighbors.

2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia

347

Figure 1. The Algorithm Structure

A. Pre-processing A classification of speech into voiced or unvoiced sounds is

necessary for subsequent processing, for example pitch contour and rhythm pattern estimation. In this way, first speech is segmented into the windowed frames with duration of 25 ms and with an overlap of 15 ms between successive frames. Then zero-crossing rate and energy of speech are calculated. SUV detection is done using following rules:

Zero-crossing rate is high for an unvoiced region, but energy is low at this point, the reverse rule is true for voiced speech, and both are approximately zero for silence [10].

B. Feature Extraction Some of the distinctive characteristics of an accent are

carried by the prosody. Prosody is the pattern of rhythm, stress and intonation of speech. The prosody of speech can be characterized by pitch contour, speaking rates and intensity.

In this paper, the pitch contour for voiced segments is obtained by autocorrelation function. Then its smoothed using Cubic Spline algorithm to represent continues form and enables us to measure the contour derivative.

Speaking rate is one of the important rhythm-related features. We estimated speaking rate by the inverse of the average length of the voiced parts. The other rhythm-related feature is voicing rate which is defined by the division of the number of voiced regions by the sum of voiced and unvoiced parts.

Finally, we have measured a total of fifty statistical features in this work, grouped under the headings below:

Statistics on the smoothed pitch contour: min, max, mean, median, standard deviation, range and rate of initial rise and final fall.

Statistics on the derivative of the pitch contour: min, max, mean, median and standard deviation for derivative curve and durations of positive and negative regions of the derivative

curve; Zero crossing rate; durations of positive and negative regions/durations of (positive+ negative) regions.

Statistics on the smoothed energy contour: min, max, mean, median and standard deviation.

Statistics on the derivative of the energy contour: min, max, mean, median and standard deviation for the derivative curve and durations of positive and negative regions of the derivative curve; durations of positive and negative regions/durations of (positive+ negative) regions.

Statistics related to rhythm: speaking rate and voicing rate.

C. Feature Selection Feature selection is the technique of selecting most relevant

features to improve the performance of learning algorithms. It is often used in high-dimensional domains to eliminate unsuitable features.

In this paper, we carried out our experiments with sequential forward feature selection algorithm. In SFS algorithm, the best subset of features is initialized as the empty set and at each step, the best single feature is selected then it’s added to this feature set [11].

In this work, the criterion for selecting the best feature is the correct classification rate achieved by the SVM classifier.

D. Classification We have used some standard pattern recognition techniques

to purpose of accent classification and compared their performance. The three methods used are support vector machines (SVM), probability neural network (PNN) and K-nearest neighbors (KNN). Each will be briefly discussed below.

The SVM is a non-probabilistic binary classifier which uses supervised learning. It constructs an optimum separable hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification. SVM is a two-class

SUVing

Energy Contour Estimation

Pitch Contour Estimation

Statistics on Contour and its Derivative

Rhythm Related Features

Statistics on Contour and its Derivative

Sequential Forward Feature Selection

Classification Using PNN, KNN & SVM

Speech

2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia

348

separator. A multi-class recognition can be achieved by combining the binary SVMs. Common method to this purpose is using of the “One against all” algorithm. In this paper, for classification by SVM, we use the LIBSVM [12] software for multi-class case with applying Gaussian nonlinear kernel function.

PNN is a kind of supervised RBF neural networks that is based on an exponential probability density function. It acts according to Bayes decision rule. This network can be used in online applications because of parallel computation, simple training and its high speed.

The KNN algorithm is a method for classifying unlabeled objects based on a majority vote of closest training samples labels in the feature space.

IV. EXPERIMENTAL RESULTS This section investigates whether the proposed algorithm

described in Section III is effective for automatic accent identification. This algorithm extracts prosodic features related to each accent.

For evaluation, our data collection (SAS) is employed. SAS consists of 960 utterances expressed by 20 male and 20 female speakers in five regional Persian accents.

Two experimental configurations were defined. In both configurations, to select train and test data a leave-one-out paradigm is employed. This cross-validation method selected one speaker at a time for testing and the rest for training. Hence, the test data selection is random, these experiments repeat several times and the average of the results is announced. We will discuss these configurations in detail below.

A. Base Performance The analyses of the proposed algorithm resulted in building

up a feature set of 50 attributes, which can be summarized as follows: pitch features (26), intensity features (22) and rhythm-related features (2).

In first experiment, we examine the contribution of the whole feature set for accent classification using three classifiers. Results from this experiment for male subjects, female subjects and whole database are shown in Table III.

TABLE III. CORRECT ACCENT CLASSIFICATION USING FULL FEATURE SET(%)

whole SAS Male subjects Female subject

SVM 48 42.8 55.8 KNN 46.4 48.6 49.6 PNN 41.6 46.4 46.4

As can be seen, the highest classification rate in whole SAS is 48% which is obtained by SVM classifier. In the next subsection, we employ a feature selection algorithm to improve these results.

B. Accent Classification Based on Feature Selection In this subsection, a feature selection algorithm is employed

to identify most discriminative features for accent

classification. In this way, the best features are selected using sequential forward feature selection algorithm employing the classification performance achieved by the SVM classifier as criterion since it has better result respect to the two other classifiers according to Table III.

Correct accent classification on the whole SAS achieved by the N-best features with progressively increasing N, is shown in Fig. II.

Figure 2. Performance (%) of accent classification using SVM as a function

of the number of attributes. Dotted lines indicate 10 and 15 attributes.

Based on this graph, the classification rate is approximately stable about 70%, between N=10 and N=15. The N-best features with N = 10 are maximum duration of positive regions of pitch derivative; percentage of positive regions duration of energy derivative; minimum duration of positive regions of pitch derivative; median of pitch derivative; minimum duration of negative regions of pitch derivative; median of energy values; mean value of derivative pitch; speaking rate; median duration of positive regions of pitch derivative and mean of pitch values.

These results confirm that the features related to pitch derivative have significant role to describe an accent. Classification results obtained with the 10-best features are shown in Table IV.

TABLE IV. CORRECT ACCENT CLASSIFICATION USING 10-BEST FEATURE SET(%)

whole SAS Male subjects Female subject

SVM 70.6 68 74 KNN 67.3 60 82 PNN 68.3 59 78.7

As shown in Table IV, restricted set of 10-best features yielded classification results higher than those obtained with the full feature set. The highest average classification rate is 70.6%. Although this result is far from the ideal value but it is relatively close to the perceptual averaged result (76.7% in Table II).

Table V corresponds to the confusion matrix obtained by SVM and using the 10-best features in the whole SAS database.

Accuracy

Number of Features 0 5 10 15 20 25 30 35 40 45 50

0

10

20

30

40

50

60

70

80

90

100

Human Classification Performance

2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia

349

TABLE V. CONFUSION MATRIX OBTAINED FOR WHOLE SAS USING THE 12-BEST FEATURES (%). ROWS CORRESPOND TO THE REFERENCE WHILE

COLUMNS GIVE THE AUTOMATIC CLASSIFICATION USING SVM.

TRK MZN TEH KRD ESF

TRK 53.3 0 26.6 0 20 MZN 0 90 0 15 0 TEH 0 0 63.3 23.3 13.3 KRD 3.3 6.7 10 80 0 ESF 6.7 0 13.3 13.3 66.7

This table reveals that MZN and KRD speakers are better identified by automatic accent classification than by human subjects’ perception.

V. CONCLUSION This paper described a study of some Persian

accents including Turkish, Kordish, Tehrani, Isfahani and Mazandarani accents. A specially designed corpus of read speech from 40 speakers was recorded, which is called Sahand Accented speech.

The proposed study investigated Persian accents from perception, speech processing and data mining perspectives. Perception experiments were arranged to understand human ability to detect accents. In Speech processing phase, prosodic features including rhythm-related features and statistics on pitch contour, energy contour and their derivatives were measured. The sequential forward feature selection was used to identify the most discriminant attributes to classify speakers according to their accents.

The results confirmed that Major identified accent-specific features are related to the derivative of pitch contour. By employing SVM classifier, restricted set of 10-best features yielded classification rate higher than those obtained with the full feature set which is compared to perceptual results.

REFERENCES

[1] Q. Yan and S. Vaseghi, “Modeling and synthesis of English regional accents with pitch and duration correlates”, Computer Speech and Language, Vol. 24, No. 4, pp. 711-725, 2010.

[2] L. M. Arsalan and J. H. L. Hansen, “Language accent classification in American English”, Speech Communication, Vol. 18, No. 4, pp. 353-367, 1996.

[3] C. Pedersen and J. Diederich, “Accent Classification Using Support Vector Machines”, 6th IEEE. Conf. on Computer and Information Science, pp. 444 – 449, 2007.

[4] A. Rabiee and S. Setayeshi, “Persian Accents Identification Using an Adaptive Neural Network”, 2th .Int. Conf. on Education Technology and Computer Science, pp. 7-11, 2010.

[5] Q. Yan, S. Vaseghi, D. Rentzos and D. Ching-Hsiang HO, “Analysis of Synthesis of Acoustic Correlates of British, Australian and American Accents”, proc. 4th IEEE. Conf. on Acoustic, Speech and Signal Processing, 2004.

[6] S. Deshpande, S. Chikkerur and V. Govindaraju, “Accent Classification in Speech”, 4th IEEE. Workshop. On Automatic Idntification Advanced Technologies, 2005.

[7] P. Mareüil and B. Vieru-Dimulescu, “The Perception of Foreign Accent”, Phonetica, Vol. 63, No. 4, pp. 247-267, 2006.

[8] B. Vieru, P. Boula and M. A. Decker, “Characterization and Identification of Non-native French Accents”, speech communication, Vol. 53, No 3, pp. 292-310, 2011.

[9] A. Ikeno and J. H. L. Hansen, “The Effect of Listener Accent Background on Accent Perception and Comprehension1”, EURASIP Journal on Audio, Speech, and Music Processing, Vol. 2007, No. 3, 2007.

[10] Mark Greenwood and Andrew KInghorn, “SUVing: Automatic Silence/Unvoiced/Voiced Classification of Speech'', Presented at the university of Sheffield.

[11] L. Ladha amd T. Deepa, “Feature Selection Methods and Algorithms”, Internatinal of Computer Science and Engineering Vol. 3, No. 5, pp. 1787-1797, 2011.

[12] C. C. Chang and C. J. Lin, LIBSVM Version 3.1, 2011, <http://www.csie.ntu.edu.tw/cjlin/libsvm>

2012 IEEE Symposium on Industrial Electronics and Applications (ISIEA2012), September 23-26, 2012, Bandung, Indonesia

350