1 synopsis word segmentation is important in developing a text-to-speech (tts) system for cantonese...

1

SynopsisWord segmentation is important in developing a text-to-speech (TTS) system for Cantonese for several reasons.(1) For any type of synthesis, words must be identified in the

text in order to model positional effects such as fusion of coda /p/ with initial /h/ into aspirated [ph] within a word.

(2) Concatenative synthesis also requires a list of words large enough to identify all word-internal sequences to record to model such positional effects. The only way to get such a list is to segment a large corpus.

(3) Concatenative synthesis with a fixed inventory of units also requires such a word list to identify the best basic units, and determine the optimal inventory of such units.

This paper describes our use of the Segmentation Corpus (a lexicon of 33k words extracted from a large corpus of Cantonese newspapers) to define and constrain an inventory of concatenative units.

2

1. Some facts about Cantonese The term “Cantonese” is used here to refer to standard Hong

Kong Cantonese (i.e. not to the original Canton City standard or other regional varieties spoken in neighboring counties).

Cantonese is written with Chinese characters, which provides the usual problems for text analysis, plus some (see section 3).

Cantonese morphology and word-level phonology are not well studied relative to Mandarin varieties (see section 2), and there are no standard dictionaries of polysyllabic words comparable to the Xiandai Hanyu Cidian 現代漢語詞典 . However, ... There are newspapers written in Hong Kong Cantonese, which provide a basis for developing a segmented word list, with

compiled text-frequencies (see section 2).

3

2. The Segmentation Corpus Created using word-segmentation criteria developed by

researchers at the Chinese Language Centre and the Dept. of Chinese and Bilingual Studies, Hong Kong Polytechnic University. The Cantonese corpus that we used is part of this larger corpus of segmented Chinese texts.

The Cantonese corpus is an electronic database of around 33k Cantonese word types extracted from a 1.7 million character corpus of Hong Kong newspapers, along with a tokenized record of the text.

An example of segmented text:^^ 此外 ^^, ^^^^** 懲教署 ^^ 人員 ^^ 亦 ^^ 將 ^^ 會 ^^ 加強 ^^ 在 ^^ 營 ^^

內 ^^ 搜索 ^^ 武器 ^^ 的 ^^ 行動 ^^, ^^^^ 在 ^^ 有 ^^ 需要 ^^ 時 ^^ 會 ^^ 有 ^^ 警方 ^^支援 ^^ 。 ^^^^ 政府 ^^ 亦 ^^ 將 ^^ 儘快 ^^ 安排 ^^ 加強 ^^ 圍欄 ^^ 的 ^^

穩堅性 ^^, ^^^^ 及 ^^ 加強 ^^ 船民 ^^ 中心 ^^ 外圍 ^^ 的 ^^ 保安 ^^ 。 ^^

4

A snippet of the resultant word list, where each word entry is a string of Chinese characters followed by a pronunciation field and the token frequency.

有 jau5 9292 ‘have’

警方 ging2 fong1 493 ‘police’

支援 zi1 wun4 45 ‘support’

政府 zing3 fu2 2051 ‘government’

亦 jik6 2716 ‘also’

將 zoeng1 4097 ‘will (aux.)’

儘快 zeon6 faai3 86 ‘as soon as possible’

安排 on1 paai4 364 ‘arrange’

加強 gaa1 koeng4 305 ‘strengthen’

圍欄 wai4 laan4 3 ‘fence’

5

Segmentation criteria —A “word” is a string of Chinese characters that:(1) is an independent part of speech e.g. 盒子 ‘ box’ (noun)(2) has a meaning that is not simply a sum of its parts

e.g. 火車 ‘ train’ (noun) ≠ 火 ‘ fire’ and 車 ‘ vehicle’

(3) consists of no more than four characters(4) either is listed in Xiandai Hanyu Cidian 現代漢語詞典 or

Zhongguo Chengyu Da Cidian 中國成語大辭典 or meets a predetermined frequency threshold (for strings of text not listed in these two dictionaries).

However, segmentation is only half the work of developing the word list, because of the nature of the writing system ...

6

3. The Cantonese writing system

Multiple readings of a character pose a problem: Orthographic forms where the variation is stylistic.

e.g. 支援 tsi:1wu:n4 ~ tsi:1jyu:n4 ‘support’ Orthographic forms where the variation in pronunciation corresponds to different words, with different meanings. e.g. 正當 tse3t:1 ‘while’ (function word)

正當 tse3t:3 ‘proper’ (content word) Particles can be written with special Cantonese characters.

e.g. 囉 l:1, 咩 m:1, 喎 w:3or, in more formal writing, they may be left to the reader to interpolate from a character “borrowed” from some other morpheme: e.g. 的 writes the second morpheme in 目的 ‘ aim’, but suggests k:3 ‘genitive particle’, because it also writes a genitive particle de (in Pinyin) in Mandarin.

7

Therefore, to use the Segmentation Corpus word list for TTS: The first author examined each entry in the wordlist of the

Corpus; Corrected many transliterations; and Adjusted frequencies when a single orthographic form writes

more than one (phonological) word.

Subsequently, approximately 90 original entries were split into separate entries by this processing. That is, 32,840 entries became 33,037 entries.

8

4. Cantonese phonology

Syllable structure

Lexical Tone | σ / \Onset Rhyme / / \ (C) V: V(:) S N

G

σ= syllableS = stop codaN = nasal codaV: = long nuclear vowelV = short nuclear vowelG = off-glide

Syllabic nasals: m

9

19 Consonants

labial dental palatal velar labiovelar

ph th, tsh kh khw

p t, ts k kw

f s h

m n, l j w

Consonants marked in red can occur in syllable final position.

10

front central back

round round

i: y: u: high

e o mid (short)

: : : mid (long)

a: low vowels

11 Vowels

11

6 Tones

Time (s)0 0.7

100

300

F0 contours for six words [wj] with different tones. Numbers to the right identify the endpoints of the two rising tones (in grey) and numbers to the left identify starting points of the other four tones (in black). The discontinuities in [wj4] are where the speaker breaks into creaky voice. HK Cantonese has five tones (i.e all tones except tone 5) in contrast on syllables closed with [p, t, k].

tone 1

tone 3tone 6

tone 4

tone 2

tone 5

12

Onset and rhyme counts:

Lexical Tone = 6 | σ / \Onset Rhyme 19 11 vowels * 8 codas +11 vowels + 2 syllabic nasals ---------------------------- 101 rhymes in theory

If there are no phonotactic restrictions on VC combinations

The simplicity of the syllable structure, and the small number of phonotactically possible syllable types makes the syllable an attractive candidate basic unit for TTS (cf. Chu & Ching 1997). However, ...

13

Syllable fusion and phrase-final effectsE.g.1. 集 tsa:p6 ‘to collect’ ([p] unreleased)

集合 tsa:p6hp6 ‘to assemble’([p] “fuses” with [h] to become released &

aspirated)o5 jyun4 loi4 hai6 wai3

223 221 221 22 333 HL%

laa221+22

Time (s)0 1.56807

100

250

E.g.2.

An utterance of the sentence o5 jyun4loi4 hai6 wai3 ‘Oh, I get it. It was the character 慰 !’ (The context is a dictation task.) The labelling window above the signal view shows a partial transcription in the annotation conventions proposed by Wong, Chan & Beckman (in press), with a syllable-by-syllable Jyutping( 粵拼 ) transliteration (top tier), a transcription of the (canonical) lexical tones and boundary tone, and a phonetic transcription of fused forms (lowest tier). Notice the fused form [jy:n21la:212] for the phrase 原來係 jyun4loi4 hai6 ‘was’ (with the verb cliticized onto the preceding tense adverb). The HL% boundary tone is a pragmatic morpheme, which we have translated with the ‘Oh, I get it.’ phrase.

14

5. Choosing a basic unit for concatenative TTS

Compare 3 strategies of unit selection:‘economist’ 經濟學家 basic unitsJyutping ging1 zai3 hok6 gaa1 (except. units)Chu & Ching ke tsj h:k ka:# 1042 (1042)Law & Lee #k e$ts j$h :k$k a:# 1801diphones #ke e $ts tsj j j$h: :k ka: a:# 1097The table above illustrates the string of basic units and exceptional units (underlined) that would be needed to synthesize an utterance of the word ‘economist’. (Tones ignored; last column shows the theoretically possible number of basic units.)

Chu & Ching (1997) use the syllable as the basic concatenative unit.

Law & Lee (2000) replace the syllable with a necessarily cross-syllabic unit, the “final-initial combination”, as the basic unit, augmented with word-initial onsets and word-final rhymes for the transitions out of and into a pause.

Our diphone model uses positionally sensitive diphones as thebasic concatenative units.

15

The counts

(Rhyme counts in all three models adopt that in the standard syllabary of Jyutping: 52 rhyme types + 2 syllabic nasals = 54 rhymes)

Chu & Ching model:(19 onsets * 52 rhymes) + 52 rhymes + 2 syllabic nasals = 1042 syllable types

Law & Lee model:onsets = 19 in initial positionrhymes = 54 in final positioncross-syllabic units

= 54 rhymes * 32 ways to start a syllable[i.e. 19 initial onsets + 11 vowels + 2 syllabic nasals]

= 1728 SUM(subtotals)

= 19 + 54 + 1728 = 1081 unit types

16

The counts (cont’d)

Our diphone model:#(C)V = 209 combination of cons. onsets followed by a vowel +

13 ways to begin a word with onset [2 syll. nasals included] = 222

word-final rhymes= 54 rhymes * 2 positions (non- vs. phrase/word-final) = 108

cross-syllabic diphones after open syllables= 13 ways to end a syllable w/out a coda cons. * 42 onset types

[i.e. 18 initial cons. other than /h/ + 11 qualities to /h/ before the different vowels + 11 vowels when onset + 2 syllabic nasals]

= 546cross-syllabic diphones where 1st syll. has a sonorant coda cons.

= 5 sonorant coda cons. * 42 onset types [see above] = 210p-fusion = /p/ coda * 11 vowel qualities to initial /h/ = 11 SUM(subtotals)

= 222 + 108 + 546 + 210 + 11 = 1097 unit types

17

Advantages of our diphone model It differentiates codas from onset consonants. I.e. rhyme aak$

≠cross-syllabic diphone aa$k. Spectral continuity between the initial and rhyme is captured in

the CV diphones (e.g. #gi and zai). The diphones capture the dependency between the quality of

the [h] and that of the following vowel (i.e. one records separate cross-syllable diphones for i$ho, i$hi, i$haa, and so

on). The number of theoretically possible units is smaller compared

with Law & Lee’s model, because we do not record consonant sequences that abut silence with silence. E.g. aak$ can be combined directly with $ka or $ta, so no cross-syllabic units need to be recorded for k$k and k$t.

18

Segmentation Corpus Attested Diphone Types Using Tones:2292

For comparison, the number of attested diphones ignoring tone: 634

Recording each diphone in a disyllabic carrier word, a Cantonese speaker could speak all of the words to make a new

voice in a single recording session.

Why use tones? — For naturalness. In Cantonese, every syllable bears a (full) tone; tones are

rarely deleted in running speech. Voice quality is part of the tonal specification as suggested by the contour for tone 4. Recording different units for rhymes

with different tones should be desirable. Need to insure tonal continuity when sonorant segments of

different tone sequences abut at syllable edges in different cross-syllabic units.

19

6. Conclusion We have shown one way of using a segmented database to

inform the design of a unit inventory for TTS. We have augmented the Segmentation Corpus with

transliterations that would let us predict more accurately the pronunciation that a Cantonese speaker adopting a careful speaking style would be likely to produce for a character sequence.

Judgements about the phonology of Cantonese, in combination with the new word list, and the associated word frequency data, can be used to assess the costs and likely benefits of different strategies for unit selection in Cantonese TTS.

We present data indicating the feasibility of a new diphone selection strategy that finesses some of the problems in modelling the interactions between tone and segmental identity.

It remains to be demonstrated that this strategy can actually deliver the results which it appears to promise.

20

7. References Chan S. D. and Tang Z. X. (1999) Quantitative Analysis of

Lexical Distribution in Different Chinese Communities in the 1990’s. Yuyan Wenzi Yingyong [Applied Linguistics], No.3, 10-18.

Chu M. and Ching P. C. (1997) A Cantonese synthesizer based on TD-PSOLA method. Proceedings of the 1997 International Symposium on Multimedia Information Processing. Academia Sinica, Taipei, Taiwan, Dec. 1997.

Law K. M. and Lee Tan (2000) Using cross-syllable units for Cantonese speech synthesis. Proceedings of the 2000 International Conference on Spoken Language Processing, Beijing, China, Oct. 2000.

Wong W. Y. P., Chan M. K-M., and Beckman M. E. (in press) An autosegmental-metrical analysis and prosodic conventions for Cantonese. To appear in S-A. Jun, ed. Prosodic Models and Transcription: Towards Prosodic Typology. Oxford University Press.

1 synopsis word segmentation is important in developing a text-to-speech (tts) system for cantonese...

Documents

cantonese corpus

cantonese word types

segmented word list

cantonese morphology

term cantonese

word entry

resultant word list

wordsegmentation criteria