expression of speaker's intentions through sentence-final...

6
Expression of Speaker’s Intentions through Sentence-Final Particle/Intonation Combinations in Japanese Conversational Speech Synthesis Kazuhiko Iwata, Tetsunori Kobayashi Perceptual Computing Laboratory, Waseda University, Japan Abstract Aiming to provide the synthetic speech with the ability to ex- press speaker’s intentions and subtle nuances, we investigated the relationship between the speaker’s intentions that the lis- tener perceived and sentence-final particle/intonation combina- tions in Japanese conversational speech. First, we classified F0 contours of sentence-final syllables in actual speech and found various distinctive contours, namely, not only simple rising and falling ones but also rise-and-fall and fall-and-rise ones. Next, we conducted subjective evaluations to clarify what kind of in- tentions the listeners perceived depending on the sentence-final particle/intonation combinations. Results showed that adequate sentence-final particle/intonation combinations should be used to convey the intention to the listeners precisely. Whether the sentence was positive or negative also affected the listeners’ perception. For example, a sentence-final particle ‘yo’ with a falling intonation conveyed the intention of an “order” in a pos- itive sentence but “blame” in a negative sentence. Furthermore, it was found that some specific nuances could be added to some major intentions by subtle differences in intonation. The differ- ent intentions and nuances could be conveyed just by control- ling the sentence-final intonation in synthetic speech. Index Terms: speech synthesis, speaker’s intention, sentence- final particle, sentence-final intonation, conversational speech 1. Introduction Speech synthesis technology has made remarkable progress in its synthetic voice quality, and the natural-sounding synthetic speech has recently become available. Moreover, various ap- proaches to synthesizing expressive and conversational speech have been also reported in the past decade [1, 2, 3, 4]. However, due to the diversity of conversational speech, numerous prob- lems still need to be solved to build a useful speech synthesis system for robots and speech-enabled agents that communicate with humans through synthetic speech. We communicate with each other through linguistic and paralinguistic information [5]. An utterance is affected by the speaker’s intentions, attitudes, feelings, personal relationship with the listeners, and so forth. The paralinguistic features of the utterance vary a great deal. In spoken Japanese, a speaker’s intention is usually con- veyed at the end of a sentence by sentence-final particles or auxiliary verbs [6]. The functions of sentence-final particles have been extensively studied in the field of linguistics [7, 8, 9, 10, 11, 12, 13]. For example, a sentence-final particle ‘yo’ indi- cates a strong assertion, and ‘ne’ indicates a request for listen- ers’ agreement. In addition, the intonation of the sentence-final particle, namely, the sentence-final intonation plays a significant role in expressing the intention and has also been studied over the years [14, 15, 16, 17, 18, 19]. For instance, a sentence with a sentence-final particle ‘ka’ becomes a declarative sentence with a falling intonation, whereas it becomes an interrogative sen- tence with a rising intonation. Moreover, it can express vari- ous additional nuances such as surprise, admiration, concern, doubt, or anger through different intonations. The functions of the sentence-final and phrase-final intonation have been dis- cussed also in terms of turn-taking [20, 21]. However, not only the sentence-final intonation but also the intonation of the whole sentence varies depending on the speaker’s intention or attitude. In some languages other than Japanese, it has been reported that the listeners could identify the speaker’s attitudes before the end of the sentences [22, 23]. In contrast, the experiment that used two utterances of the same sentence uttered with different inten- tions demonstrated the importance of the sentence-final intona- tion in Japanese. When the speech segments of the sentence- final particle were cut off and swapped, the listeners perceived the intention that was expressed by the segment of sentence- final particle though the overall F0 contours of the utterances differed from each other [24]. On the other hand, there are not as yet many approaches to expressing the speaker’s intention by controlling the sentence- final intonation in the field of speech synthesis technology. Boundary pitch movements in Tokyo Japanese have been ana- lyzed and modeled [25]. Five boundary pitch movements were chosen and their meanings were examined through association with eight semantic scales. A prosody control method has been proposed [26], which was based on the analysis of intonations of the one-word utterance ‘n’. This study revealed that speak- ing attitudes were expressed by the F0 average height and dy- namic pattern of the word ‘n’, since it has no particular lex- ical meaning. However, none of these models above consid- ered the association of the expressions of the speaker’s inten- tion with the sentence-final particles despite the fact that the intention is verbally expressed by the sentence-final particle in Japanese spoken dialogue. Although we investigated the rela- tionship between the speaker’s intentions and sentence-final in- tonation contours in a previous study [27], we did not consider the relevance of the sentence-final particles. In this study, we focus on the listeners’ perception of speaker’s intention, associating with sentence-final particles and their intonation contours in order to enable synthetic speech to express various intentions. In Section 2, we classify sentence- final intonation contours to find what kinds of sentence-final intonation were used in actual speech. In Section 3, we se- lect several distinctive intonation contours from among the clas- sification results and conduct a subjective evaluation to find suitable sentence-final particle/intonation combinations for con- veying some specific intentions (hereafter “major intentions”). 8th ISCA Speech Synthesis Workshop August 31 – September 2, 2013 Barcelona, Spain 235

Upload: doananh

Post on 25-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Expression of Speaker’s Intentionsthrough Sentence-Final Particle/Intonation Combinations

in Japanese Conversational Speech Synthesis

Kazuhiko Iwata, Tetsunori Kobayashi

Perceptual Computing Laboratory, Waseda University, Japan

AbstractAiming to provide the synthetic speech with the ability to ex-press speaker’s intentions and subtle nuances, we investigatedthe relationship between the speaker’s intentions that the lis-tener perceived and sentence-final particle/intonation combina-tions in Japanese conversational speech. First, we classified F0contours of sentence-final syllables in actual speech and foundvarious distinctive contours, namely, not only simple rising andfalling ones but also rise-and-fall and fall-and-rise ones. Next,we conducted subjective evaluations to clarify what kind of in-tentions the listeners perceived depending on the sentence-finalparticle/intonation combinations. Results showed that adequatesentence-final particle/intonation combinations should be usedto convey the intention to the listeners precisely. Whether thesentence was positive or negative also affected the listeners’perception. For example, a sentence-final particle ‘yo’ with afalling intonation conveyed the intention of an “order” in a pos-itive sentence but “blame” in a negative sentence. Furthermore,it was found that some specific nuances could be added to somemajor intentions by subtle differences in intonation. The differ-ent intentions and nuances could be conveyed just by control-ling the sentence-final intonation in synthetic speech.Index Terms: speech synthesis, speaker’s intention, sentence-final particle, sentence-final intonation, conversational speech

1. IntroductionSpeech synthesis technology has made remarkable progress inits synthetic voice quality, and the natural-sounding syntheticspeech has recently become available. Moreover, various ap-proaches to synthesizing expressive and conversational speechhave been also reported in the past decade [1, 2, 3, 4]. However,due to the diversity of conversational speech, numerous prob-lems still need to be solved to build a useful speech synthesissystem for robots and speech-enabled agents that communicatewith humans through synthetic speech. We communicate witheach other through linguistic and paralinguistic information [5].An utterance is affected by the speaker’s intentions, attitudes,feelings, personal relationship with the listeners, and so forth.The paralinguistic features of the utterance vary a great deal.

In spoken Japanese, a speaker’s intention is usually con-veyed at the end of a sentence by sentence-final particles orauxiliary verbs [6]. The functions of sentence-final particleshave been extensively studied in the field of linguistics [7, 8, 9,10, 11, 12, 13]. For example, a sentence-final particle ‘yo’ indi-cates a strong assertion, and ‘ne’ indicates a request for listen-ers’ agreement. In addition, the intonation of the sentence-finalparticle, namely, the sentence-final intonation plays a significantrole in expressing the intention and has also been studied over

the years [14, 15, 16, 17, 18, 19]. For instance, a sentence with asentence-final particle ‘ka’ becomes a declarative sentence witha falling intonation, whereas it becomes an interrogative sen-tence with a rising intonation. Moreover, it can express vari-ous additional nuances such as surprise, admiration, concern,doubt, or anger through different intonations. The functionsof the sentence-final and phrase-final intonation have been dis-cussed also in terms of turn-taking [20, 21]. However, not onlythe sentence-final intonation but also the intonation of the wholesentence varies depending on the speaker’s intention or attitude.In some languages other than Japanese, it has been reported thatthe listeners could identify the speaker’s attitudes before the endof the sentences [22, 23]. In contrast, the experiment that usedtwo utterances of the same sentence uttered with different inten-tions demonstrated the importance of the sentence-final intona-tion in Japanese. When the speech segments of the sentence-final particle were cut off and swapped, the listeners perceivedthe intention that was expressed by the segment of sentence-final particle though the overall F0 contours of the utterancesdiffered from each other [24].

On the other hand, there are not as yet many approaches toexpressing the speaker’s intention by controlling the sentence-final intonation in the field of speech synthesis technology.Boundary pitch movements in Tokyo Japanese have been ana-lyzed and modeled [25]. Five boundary pitch movements werechosen and their meanings were examined through associationwith eight semantic scales. A prosody control method has beenproposed [26], which was based on the analysis of intonationsof the one-word utterance ‘n’. This study revealed that speak-ing attitudes were expressed by the F0 average height and dy-namic pattern of the word ‘n’, since it has no particular lex-ical meaning. However, none of these models above consid-ered the association of the expressions of the speaker’s inten-tion with the sentence-final particles despite the fact that theintention is verbally expressed by the sentence-final particle inJapanese spoken dialogue. Although we investigated the rela-tionship between the speaker’s intentions and sentence-final in-tonation contours in a previous study [27], we did not considerthe relevance of the sentence-final particles.

In this study, we focus on the listeners’ perception ofspeaker’s intention, associating with sentence-final particles andtheir intonation contours in order to enable synthetic speech toexpress various intentions. In Section 2, we classify sentence-final intonation contours to find what kinds of sentence-finalintonation were used in actual speech. In Section 3, we se-lect several distinctive intonation contours from among the clas-sification results and conduct a subjective evaluation to findsuitable sentence-final particle/intonation combinations for con-veying some specific intentions (hereafter “major intentions”).

8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain

235

Furthermore, in Section 4, we investigate sentence-final parti-cle/intonation combinations that can add subtle nuances to themajor intentions aiming to provide the synthetic speech withwide expressiveness. In Section 5, we conclude the paper withsome remarks.

2. Classification of sentence-final intonation2.1. Previous research in linguistics

As mentioned above, the functions of the sentence-final intona-tion have been discussed and various views have been proposedby the linguists. Table 1 shows some examples for the catego-rization of the sentence-final intonation in terms of the contours.In the majority of them, the sentence-final intonation contourswere classified into two categories or up to five categories atmost. However, there seems to be no accepted notion.

Table 1: Examples for categorization of sentence-final intona-tion in linguistics.

Number of Categoriescategories

2 Rise, Fall [15]4 Rise, Fall-and-rise, Fall, Rise-and-fall [16]5 Interrogative rise, Prominent rise, Fall,

Rise-and-fall, Flat [17]

2.2. Speech data

We used speech data that were created with the aim of devel-oping an HMM-based speech synthesis system that had mul-tiple HMMs depending on situations of conversation [4]. Tobuild the HMMs, we designed several situations and more than2000 sentences derived from dialogues that our communicationrobot [28] performed. These sentences were uttered by a voiceactress, to whom we did not indicated any specific intentions foreach sentence but explained the situations in which the robotwas supposed to utter each sentence. The F0 contours wereextracted using STRAIGHT [29], and the phonemes were man-ually segmented. The intonation contours at the end of theseutterances varied a great deal and expressed subtle nuances andconnotations. Of these data, 2092 utterances whose sentence-final vowel was not devoiced were used for the analysis.

The sentence-final intonation contours, that is, the F0 con-tour in the sentence-final syllable was extracted by referring tothe phoneme boundaries. Because the actual F0 values of theutterances differed from each other and were difficult to clas-sify, the time and frequency axes were normalized. To removeF0 perturbation caused by jitter and microprosody, the logarith-mic F0 contour was approximated by a third-order least squarescurve. The approximated curve was sampled at 11 points thatequally divided the duration into 10. Finally, the starting pointof the sampled curve was parallel-translated to the origin [27].

2.3. Classification of sentence-final intonation contours

The normalized F0 contours obtained by the above process wereclassified by Ward’s clustering algorithm [30], which is one ofthe hierarchical clustering algorithms and merges clusters so asto minimize the within-cluster variance.

Figure 1 shows an example of the clustering results whenthe number of clusters was set to 32. The F0 contours denotedby thick circles and a thick line are the centroids of each clus-ter. The numbers in square brackets (e.g., [C2]) are expedient

cluster IDs corresponding to the clustering sequence. Note thatthe lengths of the vertical lines do not represent the distance be-tween the clusters due to the limitation of the page layout. Var-ious sentence-final intonation contours were found, includingnot only simple rising and falling intonations but also rise-and-fall and fall-and-rise intonations.

2.4. Perceptual discrimination of intonation contours bycentroids

We found the sentence-final intonation contours were classifiedinto distinctive clusters. However, we predicted not all the pairsof cluster centroids would have a notable perceptual differencefrom each other because the clustering was based only on theshapes of the F0 contours. Therefore, a preliminary evaluationwas conducted.

A back-channel utterance ‘haa’, which had no specificlinguistic meaning, was resynthesized by the STRAIGHTvocoder [29], and its F0 contour was replaced with 127 centroidpairs obtained in the process of classifying the F0 contours into128 clusters. Fifteen listeners were randomly presented with254 ‘haa’ pairs including a reverse order for each pair of cen-troids and then asked whether they perceived the two intona-tions to be the same or different. The results of the evaluationare shown in Figure 2. The numbers in parentheses indicate thenumber of responses when the intonations by the centroids onboth sides were perceived as different. This is how we obtainedthe criteria for sifting through and selecting the F0 contours thatwould be used in the next experiments.

3. Major intentions conveyed bysentence-final particle/intonation

combinations3.1. Selection of representative intonation contours

We consulted previous studies prior to conducting the subjec-tive evaluation to investigate what kind of speaker’s intentionscould be conveyed by sentence-final particles and their intona-tion contours. Referring to the results of the preliminary evalu-ation (Figure 2), we stopped dividing clusters whose child clus-ters received 28 or fewer perceptions that their intonations weredifferent. The selected clusters were C5, C20, C6, C15, C16,and C19.

Compared with the categorization in the previous researchshown in Table 1, the centroid of the cluster C5 seems to cor-respond to the interrogative rise intonation, C20 to the fall-and-rise, C6 to the fall, C15 to the rise-and-fall, C16 to the promi-nent rise, and C19 to the flat. These results indicate that specificintonation contours corresponding to the categories in the pre-vious research could be obtained from the speech database.

3.2. Experimental setup

A subjective evaluation was conducted to clarify what kind ofintentions the listener could perceive through the sentence-finalintonations produced by the selected six centroids.

We prepared 31 short sentences consisting of a verb‘taberu’ (“eat” ) followed by a sentence-final particle (‘yo’, ‘na’,‘ne’, ‘datte’, etc.), an auxiliary verb (‘daroo’), or one of theirconcatenations (‘yone’, ‘yona’, ‘datteyo’, ‘daroone’, etc.). Syn-thetic voices of these sentences were generated by our HMM-based speech synthesis system [4]. The duration of the lastvowel of each sentence was fixed to 313 ms, which was themean duration of the last vowels of the sentences in the speech

K. Iwata, T. Kobayashi

236

[C4] [C2]

[C3] [C8]

[C6] [C15][C5] [C20]

[C7] [C23] [C9] [C11]

[C10] [C12] [C16] [C19][C14] [C33]

[C17] [C36] [C25] [C13][C29] [C39]

[C31] [C22][C48] [C21] [C24] [C35] [C38] [C27][C18] [C61]

[C32] [C30] [C26] [C28][C46] [C41] [C34] [C47] [C84] [C42][C40] [C44] [C89] [C64][C70] [C58]

[C68][C101][C60] [C52] [C56] [C49][C55] [C62][C59][C105] [C67] [C65]

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

Figure 1: Result of clustering sentence-final intonation contours when the number of clusters was set to 32.

[C4] [C2]

[C3] [C8]

[C6] [C15][C5] [C20]

[C7] [C23] [C9] [C11] [C16] [C19]

[C24] [C35] [C38] [C27] [C26] [C28][C46] [C41]

(30)

(29)

(29)

(29)

(30)

(23)

(28)

(27)

( 5)(26) (18)

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

−1

0

1

Figure 2: Preliminary evaluation results of intonations gener-ated by cluster centroids. The numbers in parentheses indicatethe number of responses when the intonations by the centroidson both sides were perceived as different.

data. Then, the sentence-final F0 contour was replaced with thesix centroids. We also designed 11 speaker’s intentions (“re-quest”, “order”, “blame”, “hearsay”, “guess”, “question”, etc.)and situations where these intentions could be indicated. Weinformed 20 listeners of the situations and speaker’s intentionsand asked them to evaluate whether or not both the lexical andintonational expressions of the stimulus were suitable for con-veying the intention on a five-level scale: –2 (unsuitable; suit-able for a different intention), –1 (rather unsuitable), +1 (rathersuitable), +2 (suitable), and 0 (none of the above).

3.3. Results and discussion

Figure 3 shows the key results of the subjective evaluation, witha particular focus on a “request”, an “order”, and “blame”.

• Sentence-final particle ‘ne’Generally, the use of ‘ne’ signals a polite “request”. This wasendorsed with the rising intonations C5 and C20 (Figures 3(a)and 3(f)). In the positive sentence ‘Tabete ne’, a “request”was also conveyed with the rising C16 intonation. On theother hand, in the negative sentence ‘Tabenaide ne’ with therising C16 and flat C19 intonations, an “order” was perceivedmore clearly than in the positive sentence.

• Sentence-final particle ‘yo’In the positive sentences ‘Tabete yo’ (Figure 3(b)) with thefalling intonations C6 and C15 and ‘Tabero yo’ (Figure 3(d))with C6, an “order” (“Don’t eat.” ) was conveyed. In con-trast, in the negative sentences ‘Tabenaide yo’ (Figure 3(g))and ‘Taberuna yo’ (Figure 3(i)), which meant prohibition,“blame” (“Why did you eat even though I told you not to?”)was strongly conveyed with the falling intonations C6 andC15. When with the rising and flat intonations C5, C16, andC19, they caused an “order” impression.

• Sentence-final particles ‘yone’ and ‘yona’‘Yone’ is known to have different lexical functions from ‘ne’and ‘yo’. “Blame”, which was not much perceived in thesentences with ‘ne’, was conveyed with the rising C16 andflat C19 intonations (Figures 3(c) and 3(h)). This tendencydiffers from the case with ‘yo’, where “blame” was conveyedwith the falling intonations C6 and C15. “Blame” could beconveyed also in ‘yona’ with C16 and C19 (Figures 3(e) and3(j)).

To summarize the results, we can obtain Table 2, whichshows the highest scored intentions for each combination. Thenwe can consult this table when we generate synthetic speech.For example, we can express “request” in a positive sentencewith a sentence-final particle ‘ne’ by using C5 or C20 as thesentence-final intonation. When we need to express “blame”in a negative sentence with a sentence-final particle ‘yo’, weshould use C6 or C15.

8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain

237

Request

Order

Blame

−2 −1 0 1 2

(a) ‘Tabete ne’

Request

Order

Blame

−2 −1 0 1 2

(b) ‘Tabete yo’

Request

Order

Blame

−2 −1 0 1 2

(c) ‘Tabete yone’

Request

Order

Blame

−2 −1 0 1 2

(d) ‘Tabero yo’

Request

Order

Blame

−2 −1 0 1 2

(e) ‘Tabero yona’

95%

Confidence

interval

C5

C20

C6

C15

C16

C19

Request

Order

Blame

−2 −1 0 1 2

(f) ‘Tabenaide ne’

Request

Order

Blame

−2 −1 0 1 2

(g) ‘Tabenaide yo’

Request

Order

Blame

−2 −1 0 1 2

(h) ‘Tabenaide yone’

Request

Order

Blame

−2 −1 0 1 2

(i) ‘Taberuna yo’

Request

Order

Blame

−2 −1 0 1 2

(j) ‘Taberuna yona’

C5

2 4 6 8 10

−1

−0.5

0

0.5

1

C20

2 4 6 8 10

−1

−0.5

0

0.5

1

C6

2 4 6 8 10

−1

−0.5

0

0.5

1

C15

2 4 6 8 10

−1

−0.5

0

0.5

1

C16

2 4 6 8 10

−1

−0.5

0

0.5

1

C19

2 4 6 8 10

−1

−0.5

0

0.5

1

Figure 3: Subjective evaluation results of major intentions depending on sentence-final particle/intonation combinations. The sentencesin (a), (b), (c), (d), and (e) are positive (roughly, “Please eat.” ), and the others are negative (“Please don’t eat.” ).

Table 2: Speaker’s intention conveyed by sentence-final particle/intonation combination. The intentions (“request”, “order”, and“blame”) that received the highest positive score are shown by the initial letters and underlined when their scores are higher than orequal to 1.0 (rather suitable). *** p<0.001, ** p<0.01, * p<0.05, + p<0.1, where p is the maximum p-value among two comparisons.

Sentence-final particle Sentence-final intonationC5 C20 C6 C15 C16 C19

(Phrase)

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

‘Tabete ne’ R*** R** – – R* O‘Tabenaide ne’ R R – O/B O* O*

‘Tabete yo’ R** R** O+ O R B‘Tabenaide yo’ O+ R* B** B*** O* O*

‘Tabete yone’ R O B B O O‘Tabenaide yone’ O R – B** B B

‘Tabero yo’ R R** O B – R/O‘Taberuna yo’ O O B*** B*** O+ O***‘Tabero yona’ O B O B O/B B

‘Taberuna yona’ B O* B** B B B*

4. Additional nuances conveyed bysentence-final particle/intonation

combinationsAs the next step of this study, we investigated whether the lis-teners could perceive additional intentions, attitudes, or feelings(hereinafter collectively called “additional nuances”) throughthe sentence-final intonations.

4.1. Selection of representative intonation contours

We increased the types of representative intonation contours tobe used for the subjective evaluation of additional nuances. Theselected six clusters in 3.1 were divided into several subclusters.This time, we merged two clusters from the 128 leaf nodes to thesix clusters. When two clusters received 27 or more out of 30(more than or equal to 90%) perceptions that their intonationswere different, they were not merged. Additionally, their parentcluster was not merged with its paired cluster either. Thus, thecluster C5 was ultimately divided into 8 subclusters, C20 into

3, C6 into 11, C15 into 5, C16 into 7, and C19 into 7, as listedin Table 3.

4.2. Experimental setup

We defined the intentions of a “request”, an “order”, and“blame” as the major intentions for this evaluation. Consideringthe results of the previous section (Table 2), we chose sentence-final particle/intonation combinations from among those whichreceived the significantly different (p < 0.05) score: ‘Tabete ne’with the sentence-final intonations by the subclusters of C5,C20, and C16 for a “request”; ‘Tabenaide yo’ with the subclus-ters of C16 and C19 for an “order”; and ‘Tabenaide yo’ with thesubclusters of C6 and C15 for “blame”. We generated 18 syn-thetic utterances with different intonations for the major inten-tion “request”, 14 utterances for the “order”, and 16 utterancesfor the “blame” as the stimuli.

Three additional nuances for each major intention (Table 4)were designed, which one of the authors perceived from thesestimuli in advance. The stimuli were presented to 22 listeners,

K. Iwata, T. Kobayashi

238

Table 3: Selected subclusters as representative intonationcontours to be used for the evaluation of additional nu-ances. Their centroids are shown in Figure 4.

Parent Subclusterscluster # IDs

C5 8 C12, C36, C61, C30, C43, C130, C184, C23C20 3 C100, C135, C41C6 11 C9, C70, C334, C282, C232, C139, C151,

C233, C125, C65, C22C15 5 C24, C329, C117, C74, C155C16 7 C123, C346, C248, C71, C131, C104, C52C19 7 C214, C269, C229, C101, C56, C112, C202

Table 4: Evaluated additional nuances for each major intention.

Major Additional nuanceintention The speaker seemsRequest (r1) to sincerely request the listener to eat.

(r2) not to really want the listener to eat.(r3) to be slightly in a bad mood.

Order (o1) to be afraid the listener will definitely eat.(o2) not to be afraid the listener will definitely eat.(o3) to be in a slightly bad mood.

Blame (b1) to be in a hurry to stop the listener from eating.(b2) to be in a rage after the listener has eaten.(b3) to be disheartened after the listener has eaten.

C12 C36 C61 C30 C43 C130 C184 C23 C100 C135 C41 C123 C346 C248 C71 C131 C104 C520

1

2

3

95%Confidenceinterval

(r1)

(r2)

(r3)

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

(a) ‘Tabete ne’ with major intention “request”.

C123 C346 C248 C71 C131 C104 C52 C214 C269 C229 C101 C56 C112 C2020

1

2

3

95%Confidenceinterval

(o1)

(o2)

(o3)

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

(b) ‘Tabenaide yo’ with major intention “order”.

C9 C70 C334 C282 C232 C139 C151 C233 C125 C65 C22 C24 C329 C117 C74 C1550

1

2

3

95%Confidenceinterval

(b1)

(b2)

(b3)

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

2 4 6 8 10

−1

−0.5

0

0.5

1

(c) ‘Tabenaide yo’ with major intention “blame”.

Figure 4: Subjective evaluation results of additional nuances to major intentions. The centroids of the subclusters of C5, C20, and C16were used for expressing the major intention “request”, those of C16 and C19 for “order”, those of C6 and C15 for “blame”.

along with the situation of the dialogue and the additional nu-ances for each of the three major intentions. The listeners wereasked to evaluate whether they felt the additional nuances on afour-level scale: 3 (strongly felt), 2 (felt), 1 (slightly felt), and0 (not felt). In addition, they were asked to freely describe anyother nuances they felt and evaluate the stimuli in the same way.

4.3. Results and discussion

The results are shown in Figure 4. Several intonation contourswere found to be able to convey some additional nuances.

• “Request” (Figure 4(a))The additional nuances (r1) and (r2) have mutually exclusivemeanings. C61, C23, C43, and C131 conveyed the nuance(r1) well. They all rise toward a considerably high frequencyat the end of a sentence, which is a characteristic that can beconsidered to convey “sincerity” or “cordiality”. In contrast,C248 and C123, which were slightly rising and rather flat

overall contours, implied (r2). C135 and C41, which werefall-and-rise contours with the extreme lowering, insinuated(r3). As for the free description, some listeners noted thatC248 and C346 created a sense of familiarity and caused animpression that the speaker was just like a senior (such as aparent of a friend). However, whether others similarly per-ceive it or not needs to be investigated further.

• “Order” (Figure 4(b))The additional nuances (o1) and (o2) are mutually exclusive.C269, C101, and C229, which were undulated, conveyed(o1). C71, which were slightly rising and undulated, alsoconveyed (o1). On the other hand, (o2) was not very clearlyconveyed, but by C248. C229, C214, and C101 conveyed(o3) in addition to (o1).

• “Blame” (Figure 4(c))The additional nuance (b1) seemed to be expressed by a largerise-and-fall movement such as C117 and C155. C24 and

8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain

239

C329, which had a steep fall after slight rise, expressed (b2)clearly. C151 and C233, which had a slight fall, expressed(b3).

5. ConclusionsWe investigated the expression of speaker’s intentions throughsentence-final particle/intonation combinations in Japaneseconversational speech. Results showed that the sentence-final intonation contours varied a great deal and that adequatesentence-final particle/intonation combinations should be usedto convey the intention to the listeners precisely. Furthermore,it was found that some specific nuances could be added to somemajor intentions by subtle differences in intonation.

We indicated that not only major intentions but alsosubtle nuances could be expressed by sentence-final parti-cle/intonation combinations. However, other prosodic featuressuch as the duration, power, and voice quality of the sentence-final syllable must be changed in actual speech to contributeto conveying the intentions. They need to be adequately con-trolled in order to express the intentions more precisely andmore expressively. We intend to elucidate the relationship be-tween these features and the intentions in our future work.

6. AcknowledgementsThis work was supported in part by Global COE Program“Global Robot Academia”.

7. References[1] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and

J. Pitrelli, “A corpus-based approach to expressive speech syn-thesis,” in Proc. 5th ISCA ITRW on Speech Synthesis, 2004, pp.79–84.

[2] Y. Sagisaka, T. Yamashita, and Y. Kokenawa, “Generation andperception of F0 markedness for communicative speech synthe-sis,” Speech Communication, vol. 46, no. 3–4, pp. 376–384, July2005.

[3] M. Schroder, “Expressive speech synthesis: Past, present, andpossible futures,” in Affective Information Processing, J. Tao andT. Tan, Eds. London: Springer-Verlag, 2009, pp. 111–126.

[4] K. Iwata and T. Kobayashi, “Conversational speech synthe-sis system with communication situation dependent HMMs,” inProceedings of the Paralinguistic Information and its Integra-tion in Spoken Dialogue Systems Workshop, R.-C. Delgado andT. Kobayashi, Eds. New York: Springer, 2011, pp. 113–123.

[5] K. Maekawa, “Production and perception of ‘paralinguistic’ in-formation,” in Proc. Speech Prosody, 2004, pp. 367–374.

[6] S. Makino and M. Tsutsui, A Dictionary of Basic Japanese Gram-mar. Tokyo: The Japan Times, Ltd., 1986.

[7] The National Language Research Institute, Bound Forms (‘Zyosi’and ‘Zyodosi’) in Modern Japanese: Uses and Examples. Tokyo:Shuei Shuppan, 1951 [in Japanese].

[8] M. Tsuchihashi, “The speech act continuum: An investigation ofJapanese sentence final particles,” J. Pragmatics, vol. 7, no. 4, pp.361–387, August 1983.

[9] H. M. Cook, “The sentence-final particle ne as a tool for cooper-ation in Japanese conversation,” in Japanese/Korean Linguistics,H. Hoji, Ed. Stanford: The Stanford Linguistics Association,1990, vol. 1, pp. 29–44.

[10] A. Kamio, “The theory of territory of information: The case ofJapanese,” J. Pragmatics, vol. 21, no. 1, pp. 67–100, January1994.

[11] S. K. Maynard, Japanese Communication: Language andThought in Context. Honolulu: University of Hawai‘i Press,1997.

[12] H. Saigo, The Japanese Sentence-Final Particles in Talk-in-Interaction. Amsterdam: John Benjamins Publishing Co., 2011.

[13] Y. Asano-Cavanagh, “An analysis of three Japanese tags: Ne,yone, and daroo,” Pragmatics & Cognition, vol. 19, no. 3, pp.448–475, 2011.

[14] N. Yoshizawa, “Intoneeshon (Intonation),” in A Research for Mak-ing Sentence Patterns in Colloquial Japanese 1: On Materials inConversation. Tokyo: Shuei Shuppan, 1960 [in Japanese], pp.249–288.

[15] T. Moriyama, “Bun no imi to intoneeshon (Sentence meaning andintonation),” in Kooza Nihongo To Nihongo Kyooiku 1: Nihon-gogaku Yoosetsu, Y. Miyaji, Ed. Tokyo: Meiji Shoin, 1989 [inJapanese], pp. 172–196.

[16] T. Koyama, “Bunmatsushi to bunmatsu intoneeshon (Sentence-final particles and sentence-final intonation),” in Speech andGrammar, Spoken Language Working Group, Ed. Tokyo: Kuro-sio Publishers, 1997 [in Japanese], pp. 97–119.

[17] S. Kori, “Intoneeshon (Intonation),” in Asakura Nihongo Kooza 3:Onsei On’in (Asakura Japanese Series 3: Phonetics, Phonology),Z. Uwano, Ed. Tokyo: Asakura Publishing Co., Ltd., 2003 [inJapanese], pp. 109–131.

[18] Y. Katagiri, “Dialogue functions of Japanese sentence-final parti-cles ‘yo’ and ‘ne’,” J. Pragmatics, vol. 39, no. 7, pp. 1313–1323,July 2007.

[19] E. Ofuka, J. D. McKeown, M. G. Waterman, and P. J. Roach,“Prosodic cues for rated politeness in Japanese speech,” SpeechCommunication, vol. 32, no. 3, pp. 199–217, October 2000.

[20] H. Tanaka, Turn-Taking in Japanese Conversation: A study ingrammar and interaction. Amsterdam: John Benjamins Pub-lishing Co., 1999.

[21] C. T. Ishi, “The functions of phrase final tones in Japanese: Focuson turn-taking,” J. Phonetic Soc. Japan, vol. 10, no. 3, pp. 18–28,December 2006.

[22] V. Auberge, T. Grepillat, and A. Rilliard, “Can we perceive at-titudes before the end of sentences? The gating paradigm forprosodic contours,” in Proc. EUROSPEECH, 1997, pp. 871–874.

[23] V. J. van Heuven, J. Haan, E. Janse, and E. J. van der Torre, “Per-ceptual identification of sentence type and the time-distibution ofprosodic interrogativity markers in Dutch,” in Proc. ESCA Work-shop on Intonation, 1997, pp. 317–320.

[24] M. Sugito, “Shuujoshi ‘ne’ no imi, kinoo to intoneeshon (Mean-ings, functions and intonation of sentence-final particle ‘ne’),” inSpeech and Grammar III, Spoken Language Working Group, Ed.Tokyo: Kurosio Publishers, 2001 [in Japanese], pp. 3–16.

[25] J. J. Venditti, K. Maeda, and J. P. H. van Santen, “ModelingJapanese boundary pitch movements for speech synthesis,” inProc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis, 1998,pp. 317–322.

[26] Y. Greenberg, N. Shibuya, M. Tsuzaki, H. Kato, and Y. Sagisaka,“A trial of communicative prosody generation based on controlcharacteristic of one word utterance observed in real conversa-tional speech,” in Proc. Speech Prosody. PS8–8–37, 2006.

[27] K. Iwata and T. Kobayashi, “Expressing speaker’s intentionsthrough sentence-final intonations for Japanese conversationalspeech synthesis,” in Proc. Interspeech. Mon.P2b.03, 2012.

[28] S. Fujie, Y. Matsuyama, H. Taniyama, and T. Kobayashi, “Con-versation robot participating in and activating a group communi-cation,” in Proc. Interspeech, 2009, pp. 264–267.

[29] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Re-structuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0extraction: Possible role of a repetitive structure in sounds,”Speech Communication, vol. 27, no. 3–4, pp. 187–207, April1999.

[30] J. H. Ward, Jr., “Hierarchical grouping to optimize an objectivefunction,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 236–244,March 1963.

K. Iwata, T. Kobayashi

240