kannada text to speech synthesis systems: emotion analysis by d.j. ravi research scholar, jss...

30
Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg,

Upload: iris-melton

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Kannada Text to Speech Synthesis Systems: Emotion Analysis

By

D.J. RAVI

Research Scholar,

JSS Research Foundation,

S.J College of Engg, Mysore-

06

Page 2: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Outline Introduction

Phonetic Nature of Kannada language

Prosodic Feature Values

Time Duration

Intensity

Pitch

Result Analysis

Conclusions

References

Page 3: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Introduction Inclusion of Emotional aspects into speech will improve the

Naturalness of speech synthesis system.

The different emotions like Sadness, Anger, Happiness are manifested in speech as prosodic elements like Time Duration, Pitch & Intensity.

The prosodic values corresponding to different emotions are analyzed at word as well as phonemic level, using speech analysis and manipulation tool PRAAT.

This paper presents the emotional analysis of the prosodic features such as time duration, pitch and intensity of Kannada speech.

Page 4: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Our Analysis shows that time duration variation for different emotions at word level are: Anger < Neutral < Happiness < Sadness Time Duration is least for Anger and highest for Sadness.

Where as Anger > Happiness > Neutral > Sadness Intensity is highest for Anger and least for Sadness.

Also the Time Duration variation at phonemic level is large for Vowels compared to Consonants.

The Pitch contour is almost flat for Neutral speech hence shows bigger variation for different emotions.

Page 5: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Kannada is a Dravidian Language & phonetic in nature having a written form that has direct correspondence to the spoken form.

The phonemes are divided into two types: Vowel (swaras)

&Consonant (vyanjanas)

Kannada language 13 Vowels & 34 basic consonants

Phonetic Nature of Kannada language

Page 6: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Vowels (Swaras) Independently existing letters

Consonants (Vyanjanas) Dependent on vowels to take a independent form of

the Consonant.

Page 7: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Consonant (Vyanjana) + Vowel (matra) --> Letter (Akshara)

Kagunitha :

The combination of consonant phoneme and a vowel phoneme produces a syllable.

Consonant phoneme + Vowel phoneme = > Syllable

Page 8: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

A universal character set

Provides a unique number for each character in a language

Supports all platforms & all the languages

Unicode

Page 9: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Kannada Unicode

Page 10: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Basic units to Word (Pada)

Page 11: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Consonants Bilabial Labio Dental

Dental Retroflex Palatal Velar Glottal

    vl vd vd vl vd vl vd vl vd vl vd vl

Plosives Un p b   t d ṭ ḍ     k g  

As ph bh   th dh ṭh ḍh     kh gh  

Affricates Un               č j      

As               čh jh      

Nasals     m     n   ṇ       ṅ  

Fricatives         s   ṣ   š       h

Liquids Laterals

          l   ḷ          

Trill           r              

Semi vowels       v           y      

Table 1 : The phonemes are categorized according to the method of speech production and articulation The column wise arrangement is according to the manner of articulation, whereas the row wise arrangement is according to the method of speech production. The phonetic nature of the language and the systematic categorization of the alphabet set can be effectively used for analysis and modeling.

Page 12: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Prosody as related to language, refers to aspects like rhythm, melody and stress.

These features are quantity (duration) , stress (intensity) and intonation (pitch).

Phonemes need to be categorized into groups based on position and context.

Each syllable is broken down into combinations of vowels and consonants.

The durational patterns of the resultant phonemes at Word Initial , Medial & Final position are analyzed.

Prosody

Page 13: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Initial Medial Final

11 ms 9 ms 8 ms

Page 14: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

The waveform, pitch contour, time duration and average intensity of the word /ba illi/ (come here) uttered in different emotions, by the same person is shown in Figure 1.

From the plot it can be seen that the prosodic features show distinct variation for different emotions in comparison with neutral speech.

Prosodic Feature Values

Page 15: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Figure 1 shows that the time duration is least for anger and highest for sadness of the sentence / ba illi / ( come here ) for different emotions.

In comparison with neutral speech (606ms), the duration of the speech increases for happiness (750ms) and sadness (1.106sec), but it reduces considerably for anger (447ms).

Angry < Neutral < Happy < Sadness.

The duration pattern varies from person to person, but different emotions show general trends.

Time Duration

Page 16: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Words Emotion Speakers  1 2 3/ yelli /(Where)

Anger 92 78 95Happiness 122 126 110Sadness 138 141 121

/ appa /(Father)

Anger 83 72 79

Happiness 112 101 102

Sadness 132 144 129

Table 2 gives the duration of the speech of the three speakers, uttering two words in different emotions, as percentage in terms of neutral speech. Neutral speech is taken as 100% and the duration of speech with each emotion is given, in terms of the duration of neutral speech (% duration = duration with emotion x 100 / neutral duration). It can be seen that even though the percentage is different for the three speakers, the general trend is same for each of the emotions.

Table 2: Duration of words (ms) uttered by different speakers in different emotions (% change in comparison with neutral speech)

Page 17: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Sentence Emotion ninna hesaru enu

/ninna hesaru enu/(What is your name)

Anger 96.25 98 78

Happiness 121.56 112.56 105.8

Sadness 185.62 129.26 121.65

Table 3 gives the duration of different words (ms) in a sentence, /ninna hesaru enu/ (What is your name) in different emotions, as percentage in terms of neutral speech. Here also it can be seen that different emotions show general trends.

Table 3: Duration of different words (ms) in a sentence for different emotions (% change in comparison with neutral speech)

Page 18: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Emotion Phonemes TotalDuration

(ms)a pp a

Anger 85 140 205 430

Neutral 132 163 221 516

Happiness 173 170 236 579

Sadness 233 196 256 685

Table 4 gives the duration values of phonemes in the word / appa / (vowels /a/ and consonant /p/). It can be seen that phonemes also follow the general trend of duration variation for different emotions.

Table 4: Duration of Phonemes (ms) in the word /appa/ (father) for different emotions.

Page 19: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Figure 2: Duration (ms) change of word /appa/ (father) for different emotions

Figure 3: Duration (ms) change of vowels /a/ and consonant /p/ in the word /appa/ (father) with four different

emotions

Page 20: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Samples Emotion Intensity/ ba illi/(come here)

Anger 113.50Happiness 110.90Sadness 98.90

/basava bandidana/(has basava come)

Anger 115.26Happiness 100.32Sadness 94.98

From Figure 1, it is seen that anger emotion is articulated with maximum intensity where as sadness has minimum intensity. i.e.

Anger > happiness > neutral > sadness.

Table 5 confirms that the average intensity variation for different emotions is least for sadness and maximum for anger.

Intensity

Table 5: Average Intensity (dB) variation for different emotions (% in comparison with neutral speech)

Page 21: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Samples Emotion Pitch

/ ba illi/(come here)

Anger 101.970

Happiness 100.384

Sadness 120.519

/ basava bandidana /(has basava come)

Anger 131.240

Happiness 140.320

Sadness 142.590

Pitch

From Figure 4, Figure 5 & Figure 6 the pitch contour of neutral speech is almost flat and is of minimum value. The following three figures show pitch contours for each emotional type sentence with its corresponding emotionless sentence.

Pitch

Table 6: Average Pitch (Hz) variation for different emotions (% in comparison with neutral speech)

Page 22: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Anger emotionless

(Why did you do this)

Anger emotion

Figure 4 :

Page 23: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

 Happiness Emotion

(What a beautiful flower)

Happiness Emotionless

Figure 5 :

Page 24: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Sadness Emotion

( I am extremely unhappy)

Sadness Emotionless

Figure 4 :

Page 25: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

Result Analysis

For instance to stimulate angerDuration has to be reduced while increasing pitch and intensity. Similarly to stimulate sadness Duration and pitch has to be increased while reducing intensity.

Due to the phonetic categorization of the alphabet set, rules need to be framed only for each category of phonemes. The phonemes in each category share similar phonetic features. This reduces the complexity of prosodic modeling as well as the framing of rules for synthesis.

Rules can be framed for different phonemes for prosodic modifications from phonemic level analysis.

Page 26: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

From the manner of articulation of different emotions it can be recognized that, the rise time and fall time can capture a lot of emotion information more than any other prosodic parameter.

For anger speech

Duration is lowest and intensity is highest.

whereas for sadness speech

Duration is highest and intensity is lowest.

Page 27: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

The duration % of different emotions, in comparison with neutral speech, calculated for different words, spoken by different speakers, shows that the duration of words is highest for sadness followed by happiness and neutral and is smallest for anger. The pitch contour is almost flat for neutral. The average pitch value for emotional speech is higher compared to neutral speech. The intensity level of a word is lowest for sadness and highest for anger. The phoneme level analysis on duration shows that it is the vowels that capture the emotional variation more compared to consonants.

Conclusions

Page 28: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

This can be used effectively for framing rules for emotional speech synthesis. Incorporating these durational effects in speech synthesis system, will produce a better speech compared to the system without using this knowledge.

Page 29: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06

ReferencesI.R. Murray, M.D. Edgington, D. Campion, etc. “Rule-Based Emotion Synthesis Using Concatenated Speech,” Proc. of ISCA Workshop on Speech and Emotion, Belfast, North Ireland, pp. 173-177, 2000.X X.J. Ma, W. Zhang, W.B. Zhu, etc, “Probability based Prosody Model for Unit Selection,” Proc. of. ICASSP’04, Montreal, Canada, pp. 649-652, May 2004.Pascal van Lieshout, Ph.D. ”PRAAT”, Oral Dynamics Lab V. 4.2.1, October 7, 2003.D.J.Ravi and Sudarshan Patilkulkarni “Kannada Text-To-Speech Systems: Duration Analysis” Proc. of ISCO 2009, Coimbatore. pp. 53.D.J.Ravi and Sudarshan Patilkulkarni “Speaker Dependent Duration Analysis of Vowels and consonants for Kannada Text-To-Speech Systems” Proc.Of NICE 2009, Bangalore. pp. 95-99.D.J.Ravi and Sudarshan Patilkulkarni “Time Duration Variation Analysis of Vowels and Consonants for KannadaText to Speech Systems.” "Journal of Advance Research in Computer Engineering: An International Journal", July to December 2009 Deepa P.Gopinath , Sheeba P.S and Achuthsankar S. Nair , “Emotional Analysis for Malayalam Text to Speech Synthesis Systems” SETIT 2007.

Page 30: Kannada Text to Speech Synthesis Systems: Emotion Analysis By D.J. RAVI Research Scholar, JSS Research Foundation, S.J College of Engg, Mysore-06