kannada text to speech synthesis systems: emotion analysis by d.j. ravi research scholar, jss...

Kannada Text to Speech Synthesis Systems: Emotion Analysis

By

D.J. RAVI

Research Scholar,

JSS Research Foundation,

S.J College of Engg, Mysore-

06

Outline Introduction

Phonetic Nature of Kannada language

Prosodic Feature Values

Time Duration

Intensity

Pitch

Result Analysis

Conclusions

References

Introduction Inclusion of Emotional aspects into speech will improve the

Naturalness of speech synthesis system.

The different emotions like Sadness, Anger, Happiness are manifested in speech as prosodic elements like Time Duration, Pitch & Intensity.

The prosodic values corresponding to different emotions are analyzed at word as well as phonemic level, using speech analysis and manipulation tool PRAAT.

This paper presents the emotional analysis of the prosodic features such as time duration, pitch and intensity of Kannada speech.

Our Analysis shows that time duration variation for different emotions at word level are: Anger < Neutral < Happiness < Sadness Time Duration is least for Anger and highest for Sadness.

Where as Anger > Happiness > Neutral > Sadness Intensity is highest for Anger and least for Sadness.

Also the Time Duration variation at phonemic level is large for Vowels compared to Consonants.

The Pitch contour is almost flat for Neutral speech hence shows bigger variation for different emotions.

Kannada is a Dravidian Language & phonetic in nature having a written form that has direct correspondence to the spoken form.

The phonemes are divided into two types: Vowel (swaras)

&Consonant (vyanjanas)

Kannada language 13 Vowels & 34 basic consonants

Phonetic Nature of Kannada language

Vowels (Swaras) Independently existing letters

Consonants (Vyanjanas) Dependent on vowels to take a independent form of

the Consonant.

Consonant (Vyanjana) + Vowel (matra) --> Letter (Akshara)

Kagunitha :

The combination of consonant phoneme and a vowel phoneme produces a syllable.

Consonant phoneme + Vowel phoneme = > Syllable

A universal character set

Provides a unique number for each character in a language

Supports all platforms & all the languages

Unicode

Kannada Unicode

Basic units to Word (Pada)

Consonants Bilabial Labio Dental

Dental Retroflex Palatal Velar Glottal

vl vd vd vl vd vl vd vl vd vl vd vl

Plosives Un p b t d ṭ ḍ k g

As ph bh th dh ṭh ḍh kh gh

Affricates Un č j

As čh jh

Nasals m n ṇ ṅ

Fricatives s ṣ š h

Liquids Laterals

l ḷ

Trill r

Semi vowels v y

Table 1 : The phonemes are categorized according to the method of speech production and articulation The column wise arrangement is according to the manner of articulation, whereas the row wise arrangement is according to the method of speech production. The phonetic nature of the language and the systematic categorization of the alphabet set can be effectively used for analysis and modeling.

Prosody as related to language, refers to aspects like rhythm, melody and stress.

These features are quantity (duration) , stress (intensity) and intonation (pitch).

Phonemes need to be categorized into groups based on position and context.

Each syllable is broken down into combinations of vowels and consonants.

The durational patterns of the resultant phonemes at Word Initial , Medial & Final position are analyzed.

Prosody

Initial Medial Final

11 ms 9 ms 8 ms

The waveform, pitch contour, time duration and average intensity of the word /ba illi/ (come here) uttered in different emotions, by the same person is shown in Figure 1.

From the plot it can be seen that the prosodic features show distinct variation for different emotions in comparison with neutral speech.

Prosodic Feature Values

Figure 1 shows that the time duration is least for anger and highest for sadness of the sentence / ba illi / ( come here ) for different emotions.

In comparison with neutral speech (606ms), the duration of the speech increases for happiness (750ms) and sadness (1.106sec), but it reduces considerably for anger (447ms).

Angry < Neutral < Happy < Sadness.

The duration pattern varies from person to person, but different emotions show general trends.

Time Duration

Words Emotion Speakers 1 2 3/ yelli /(Where)

Anger 92 78 95Happiness 122 126 110Sadness 138 141 121

/ appa /(Father)

Anger 83 72 79

Happiness 112 101 102

Sadness 132 144 129

Table 2 gives the duration of the speech of the three speakers, uttering two words in different emotions, as percentage in terms of neutral speech. Neutral speech is taken as 100% and the duration of speech with each emotion is given, in terms of the duration of neutral speech (% duration = duration with emotion x 100 / neutral duration). It can be seen that even though the percentage is different for the three speakers, the general trend is same for each of the emotions.

Table 2: Duration of words (ms) uttered by different speakers in different emotions (% change in comparison with neutral speech)

Sentence Emotion ninna hesaru enu

/ninna hesaru enu/(What is your name)

Anger 96.25 98 78

Happiness 121.56 112.56 105.8

Sadness 185.62 129.26 121.65

Table 3 gives the duration of different words (ms) in a sentence, /ninna hesaru enu/ (What is your name) in different emotions, as percentage in terms of neutral speech. Here also it can be seen that different emotions show general trends.

Table 3: Duration of different words (ms) in a sentence for different emotions (% change in comparison with neutral speech)

Emotion Phonemes TotalDuration

(ms)a pp a

Anger 85 140 205 430

Neutral 132 163 221 516

Happiness 173 170 236 579

Sadness 233 196 256 685

Table 4 gives the duration values of phonemes in the word / appa / (vowels /a/ and consonant /p/). It can be seen that phonemes also follow the general trend of duration variation for different emotions.

Table 4: Duration of Phonemes (ms) in the word /appa/ (father) for different emotions.

Figure 2: Duration (ms) change of word /appa/ (father) for different emotions

Figure 3: Duration (ms) change of vowels /a/ and consonant /p/ in the word /appa/ (father) with four different

emotions

Samples Emotion Intensity/ ba illi/(come here)

Anger 113.50Happiness 110.90Sadness 98.90

/basava bandidana/(has basava come)

Anger 115.26Happiness 100.32Sadness 94.98

From Figure 1, it is seen that anger emotion is articulated with maximum intensity where as sadness has minimum intensity. i.e.

Anger > happiness > neutral > sadness.

Table 5 confirms that the average intensity variation for different emotions is least for sadness and maximum for anger.

Intensity

Table 5: Average Intensity (dB) variation for different emotions (% in comparison with neutral speech)

Samples Emotion Pitch

/ ba illi/(come here)

Anger 101.970

Happiness 100.384

Sadness 120.519

/ basava bandidana /(has basava come)

Anger 131.240

Happiness 140.320

Sadness 142.590

Pitch

From Figure 4, Figure 5 & Figure 6 the pitch contour of neutral speech is almost flat and is of minimum value. The following three figures show pitch contours for each emotional type sentence with its corresponding emotionless sentence.

Pitch

Table 6: Average Pitch (Hz) variation for different emotions (% in comparison with neutral speech)

Anger emotionless

(Why did you do this)

Anger emotion

Figure 4 :

Happiness Emotion

(What a beautiful flower)

Happiness Emotionless

Figure 5 :

Sadness Emotion

( I am extremely unhappy)

Sadness Emotionless

Figure 4 :

Result Analysis

For instance to stimulate angerDuration has to be reduced while increasing pitch and intensity. Similarly to stimulate sadness Duration and pitch has to be increased while reducing intensity.

Due to the phonetic categorization of the alphabet set, rules need to be framed only for each category of phonemes. The phonemes in each category share similar phonetic features. This reduces the complexity of prosodic modeling as well as the framing of rules for synthesis.

Rules can be framed for different phonemes for prosodic modifications from phonemic level analysis.

From the manner of articulation of different emotions it can be recognized that, the rise time and fall time can capture a lot of emotion information more than any other prosodic parameter.

For anger speech

Duration is lowest and intensity is highest.

whereas for sadness speech

Duration is highest and intensity is lowest.

The duration % of different emotions, in comparison with neutral speech, calculated for different words, spoken by different speakers, shows that the duration of words is highest for sadness followed by happiness and neutral and is smallest for anger. The pitch contour is almost flat for neutral. The average pitch value for emotional speech is higher compared to neutral speech. The intensity level of a word is lowest for sadness and highest for anger. The phoneme level analysis on duration shows that it is the vowels that capture the emotional variation more compared to consonants.

Conclusions

This can be used effectively for framing rules for emotional speech synthesis. Incorporating these durational effects in speech synthesis system, will produce a better speech compared to the system without using this knowledge.

ReferencesI.R. Murray, M.D. Edgington, D. Campion, etc. “Rule-Based Emotion Synthesis Using Concatenated Speech,” Proc. of ISCA Workshop on Speech and Emotion, Belfast, North Ireland, pp. 173-177, 2000.X X.J. Ma, W. Zhang, W.B. Zhu, etc, “Probability based Prosody Model for Unit Selection,” Proc. of. ICASSP’04, Montreal, Canada, pp. 649-652, May 2004.Pascal van Lieshout, Ph.D. ”PRAAT”, Oral Dynamics Lab V. 4.2.1, October 7, 2003.D.J.Ravi and Sudarshan Patilkulkarni “Kannada Text-To-Speech Systems: Duration Analysis” Proc. of ISCO 2009, Coimbatore. pp. 53.D.J.Ravi and Sudarshan Patilkulkarni “Speaker Dependent Duration Analysis of Vowels and consonants for Kannada Text-To-Speech Systems” Proc.Of NICE 2009, Bangalore. pp. 95-99.D.J.Ravi and Sudarshan Patilkulkarni “Time Duration Variation Analysis of Vowels and Consonants for KannadaText to Speech Systems.” "Journal of Advance Research in Computer Engineering: An International Journal", July to December 2009 Deepa P.Gopinath , Sheeba P.S and Achuthsankar S. Nair , “Emotional Analysis for Malayalam Text to Speech Synthesis Systems” SETIT 2007.

kannada text to speech synthesis systems: emotion analysis by d.j. ravi research scholar, jss...

Documents

speech analysis

intensity of kannada

neutral speech

time duration variation

different emotions

pitch intensity

quantity duration

method of speech production