center for speech and language technologies, tsinghua university dialectal chinese speech...

60
Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University, UK

Upload: noel-harris

Post on 12-Jan-2016

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Center for Speech and Language Technologies, Tsinghua University

Dialectal Chinese Speech Recognition

Thomas Fang ZhengAug. 24, 2007

@ Cambridge University, UK

Page 2: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

2

Outline

Motivation

Dialectal Chinese database collection Wu Min Chuan

Approaches Chinese syllable mapping Lexicon adaptation State-dependent phoneme-based model merging (SDPBMM) Integration of SDPBMM with adaptation

Remarks

Page 3: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

3

Motivation

Chinese ASR encounters an issue that is bigger than that of any other language - dialect.

There are 8 major dialectal regions in addition to Mandarin (Northern China), including:- Wu (Southern Jiangsu, Zhejiang, and Shanghai); Yue (Guangdong, Hong Kong, Nanning Guangxi); Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); Hakka (Meixian Guangdong, Hsin-chu Taiwan); Gan (Jiangxi); Xiang (Hunan); Hui (Anhui) Jin (Shanxi, Hohehot Inner Mongolia).

Can be further divided into over 40 sub-categories.

Page 4: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

4

Page 5: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

5

Chinese dialects share a same written language:- The same Chinese pinyin set (canonically), The same Chinese character set (canonically), and The same vocabulary (canonically).

And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China.

However, speech is strongly influenced by the native dialects, most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect

In dialectal Chinese :- Word usage, pronunciation, and syntax and grammar vary depending on the

speaker's dialect. ASR relies to a great extent on the consistent pronunciation and usage of words

within a language. ASR systems constructed to process PTH perform poorly for the great majority of

the population.

Page 6: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

6

Research Goal

To develop a general framework to model in dialectal Chinese ASR tasks :- Phonetic variability, Lexical variability, and Pronunciation variability

To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :- dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and training/adaptation data (in relatively small quantities)

Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese.

This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world, and was postponed to 2004 due to SARS.

Page 7: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

7

Dialectal Chinese SpeechRecognition Framework

Standard ChineseSpeech Recognizer

+

Dialectal ChineseSpeech Recognizer

Dialectal Chinese RelatedKnowledge & Resources

Page 8: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

8

For practical reasons, during the summer we only focused on one specific dialect, the Wu dialect (Shanghai Area), and the target language was Wu dialectal Chinese (WDC for short);

Why Wu dialect? Population: more than 70 million people use WU dialect, the 2nd popular

dialect in China; Economy: one of the most advanced city in China – Shanghai Wu dialect is a full-developed language

The syntax of Wu dialect is very complex; The vocabulary is even more larger than Mandarin; Many literature masterpiece were influenced by WU dialect (in history).

WU Mandarin Cantonese

Phoneme# 50 37 <33

Page 9: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

9

Useful Dialect-Related Knowledge

Chinese Syllable Mapping (CSM)

This CSM is dialect-related.

Two types: Word-independent CSM: e.g. in Southern Chinese, Initial mappings

include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on;

Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word ' 中国 (China)' but only the tone is changed in word ' 过去 (past)'.

Page 10: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

10

The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable.

A

B

B1

B3

B4

B2

Bi is a variation of B, such as :-

nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration

kei kuo kui...

Standard Chinese Syllabe Set

Chuan Dialect

ke

上[课] [克]服

kuo kui

[扩]大 [魁]梧

The CSM could be N→1, 1→N, or crossed.

Page 11: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

11

Lexicon

Linguists say the vocabulary similarity rate between PTH and Wu dialect is about 60~70%

A dialect-related lexicon containing two parts :- a common part shared by standard Chinese and most dialectal

Chinese languages (over 50k words), and a dialect-related part (several hundreds).

And in this lexicon :- each word has one pinyin string for standard Chinese pronunciation

and a kind of representation for dialectal Chinese pronunciation, and

each of those dialect-related words is corresponding to a word in the common part with the same meaning

Page 12: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

12

Language

Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore

The language post-processing or language model adaptation techniques could be adopted.

Page 13: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

13

w1 w2 w3… …V2

w1 w2 w3… …w3 w2

w2

w3

w2

w3

1

2

Dialectal words substitute for some words

我 做饭 给 你 吃 (PTH)我 烧饭 给 你 吃 (Wu)

Word-order changes

你 先 走 (PTH)你 走 先 (Wu)

Page 14: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

14

AM0 = AM for standard Chinese AM1 = AM with accent AM2 = AM with dialect LM0 = LM for standard Chinese LM1 = LM with dialectal lexicon LM2 = LM with dialectal lexicon/syntax Seldom-seen in dialectal Chinese

LM0 LM1 LM2

AM2

AM1

AM0

Dialect

Standard Chinese

Our focu

s

Page 15: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

15

Database CollectionData Creation for WDC

Database e-Dictionary

DatabaseCollection

SpeechTranscription

ReadSpeech

SpontaneousSpeech

PTH Words Only

PTH + Wu Words Topics

IF & SyllableSet Definition

PTHWords

Wu DialectWords

Misc Info

IFs/GIFs

Syllables

C-Chars

Wu Dialect Pron.

PTH Pron. PTH Pron.

PTH Synonym

Wu Dialect Pron.

IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese

Page 16: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

16

Wu Dialectal Chinese (WDC) Database Collection (1)

Collection: Totally 11 hours - Half read (R) + half spontaneous (S):

– 100 Shanghai speakers * (3R +3S) minutes / speaker– 10 Beijing speakers * 6S minutes / speaker

Read speech with well-balanced prompting sentences;– Type I: each sentence contains PTH words only (5-6k)– Type II: each sentence contains one or two most commonly used Wu

dialectal words while others are PTH words Spontaneous speech with Pre-defined talking topics;

– Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology

Balanced Speaker (gender, age, education, PTH level, …)

Page 17: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

17

Num of speakers Male Female Total

Age26-40 27 25 52

41-50 23 25 48

EducationWell 41 41 82

Ordinary 9 9 18

Gender Male : 50% Female: 50%

Age 26-40 : 50% 41-50: 50%

Education Ordinary: 20% Well : 80%

Actual WDC Data Diversity

Goal

Page 18: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

18

Accent Assessment by experts

0

10

20

30

40

50

60

70

1A 1B 2A 2B 3A 3B

1A. CCTV-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented; 3B. Hard to understand but known it is PTH

Page 19: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

19

0

5

10

15

20

25

30

35

1A 1B 2A 2B 3A 3B

26- 4041- 50

Accent Assessment according to age

Page 20: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

20

05

101520253035404550

1A 1B 2A 2B 3A 3B

Ordi naryWel l

Accent Assessment according to education level

Page 21: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

21

0

5

10

15

20

25

30

35

1A 1B 2A 2B 3A 3B

Mal eFemal e

Accent Assessment according to gender

Page 22: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

22

Wu Dialectal Chinese (WDC) Database Collection (2)

Transcriptions include:- For 100 Wu Dialectal Chinese speakers:-

– Canonical Chinese Initial/Final labels, and– Generalized IF (GIF) labels.

For 10 Beijing speakers:-– Chinese character and pinyin transcriptions only

Page 23: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

23

Page 24: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

24

Dialectal Lexicon Construction

Establish a 50k-word electronic dialect dictionary with each word having :-

PTH pronunciation in PTH IF string Wu dialect pronunciation in Wu IF string

Purpose: summarizing Dialect-Related Knowledge Figure out Chinese syllable mappings:-

– Same written form (character), different pronunciations;– Both word-independent and word-dependent;

Find dialect-related word variations:-– Same meanings in Chinese language;– Different written forms (character);– Uttered in standard Chinese manner;– For LM adaptation/modification

Page 25: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

25

e-Dictionary Word Examples

Word No.

Word Pronunciationin PTH

Pronunciationin Wu Dialect

1644 本金 ben3 jin1 b en3 j in1 (unchanged)

1646 本科 ben3 ke1 b en3 k u1 (Final changed only)

1652 本领 ben3 ling3 b en3 l in2 (Final & tone)

1656 本末倒置 ben3 mo4 dao4 zhi4 b en3 m ek5 d o^3 z ii3 (Entering Sound, Final change, CI Initial change, CD Final change )

1659 本票 ben3 piao4 b en3 p voe3 (Final & tone changes)

1660 本期 ben3 qi1 b en3 jj i2 (Voiced Initial, tone change)

1661 本钱 ben3 qian2 b en3 jj i2 (1660&1: Different in PTH, same in Wu)

1662 本人 ben3 ren2 b en3 n in2 (CD Initial & Final change)

Page 26: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

2626

Post-workshop Database Collection -- Min and Chuan

* With aid of Chinese Academy of Social Sciences (CASS)

Page 27: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

2727

Name Min-dialectal Chinese database

Dialectal accent Xiamen city, Fujian province

Sampling rate 22 050 Hz

Channels3 (Two conventional microphones, One USB microphones)

Speakers 36

Age 18~30

Gender 18 females, 18 males

Constituent 200 long sentences, 10 digits, 26 English letters per speaker

Transcription Chinese Character/syllable/Initial-Final

Page 28: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

2828

Name Chuan-dialectal Chinese database

Dialectal accent Chengdu city, Sichuan province

Sampling rate 22 050 Hz

Channels3 (Two conventional microphones, One USB microphones)

Speakers 36

Age 20~30

Gender 18 females, 18 males

Constituent 200 long sentences, 10 digits, 26 English letters per speaker

Transcription Chinese Character/syllable/Initial-Final

Page 29: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

2929

Accent distribution for Min/Chuan-dialectal Chinese corpora

Page 30: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

30

Workshop Experiments Experiment Conditions:

Using HTK 3.2.1;

Data Set Division: Using spontaneous speech data only Data were split according to age (younger, older), education (higher, lower),

and PTH level into– Training Set: 80 speakers– devTest Set: 20 speakers (a part of devTrain)– Test Set: 20 speakers

Acoustic model: Trained from Mandarin Broadcast News (MBN); 39 dimensional MFCC_E_D_A_Z; diagonal covariance matrix; 4 states per unit; 103,041 units (triIF), 10,641 real units (triIF); 3,063 different states (after state tying); 16 mixtures per state, 28 mixtures per state for silence unit;

Language model: Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Dialectal Training Data

Transcriptions

Page 31: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

31

Observation on WDC Data

IF-mapping / Syllable-mapping:– Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker

often pronounce any of a certain set of IFs into another IF, and there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on.

Observations on three sets - Train (80 speakers), devTest (20), and Test (20):

– Mapping pairs almost the same among all three sets;– Mapping pairs almost identical to experts' knowledge;– Mapping probabilities also almost equal;

Remarks:– Experts' knowledge could be useful;– Mapping rules can be learned from less data.

Page 32: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

32

Using only devTest set + dialect-based knowledge

Step 1: Apply PTH-IF mapping rules;

Step 2: Apply WDC-IF mapping rules;

Step 3: Apply syllable-dependent mapping rules;

Step 4: Perform multi-pronunciation expansion (MPE) based on unigram probability.

Page 33: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

33

Why trying this method?

"IF-mapping" in dialectal Chinese is the fact (human uses it); "In-domain data training" will sure get a good result but

collecting data is a huge task, especially for 40 sub-dialects of Chinese;

"Mere adaptation" will be easier and better but might make it hard to distinguish those mapping pairs, each pair tends to become a single IF;

This is not practical in such applications where you have no more information about the speakers and a mixture of WDC and PTH is used as Call Centers;

It is expected that knowledge based method would result in an overall good performance for both WDC and PTH.

Page 34: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

34

Step 1: Applying PTH-IF mapping rules

Rules are based on experts' knowledge (with AM unchanged) (zh, z) (z, zh) (ch, c) (c, ch) (sh, s) (s, sh) (eng, en) (en, eng) (ing, in) (in, ing) (r, l)

Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction

Pronunciation entry probability does not help improve performance

Page 35: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

35

Step 2: Applying WDC-IF mapping rules

There indeed are some Wu dialect Chinese specific IFs, such as iao -> io^;

Rules learned from devTest Newly introduced WDC specific IFs trained from devTest using

adaptation method 8.66% absolute CER reduction MLLR adaptation outperforms MLLR+MAP

About 10% difference Possibly due to less data

We referred it to surface form (WDC) MLLR adaptation; for comparison purpose, the base form (PTH) MLLR adaptation is also evaluated where only canonical IFs are used.

Page 36: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

36

Step 3: Apply syllable-dependent mapping rules

Assumption: most IF-mappings are context-independent, but some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we believe there are others

Rules learned from devTest We do not succeed in improving the accuracy, on the contrary,

the character accuracy reduced by about 6% We do not have a clear explanation yet So we keep using context-free mapping rules

Page 37: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

37

Step 4: Multi-pronunciation expansion (MPE) based on unigram probability

Motivation: more pronunciations help model pronunciation variations, but lead to more confusion, there should be tradeoff;

Accumulated unigram probability (AccProb) used as the criterion

Only words with higher unigram probabilities will have multiple pronunciations each;

Words with lower unigram probabilities will have a single standard pronunciation each;

Page 38: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

38

Word Prob. (descending) Acc. Prob.

0.000

</ s> 0.10782136 0.108

的 0.03608752 0.144

你 0.02161165 0.194

是 0.01907339 0.213

标准 0.00005742 0.899 Actual minimum

… 0.00005742

团 0.00005742 0.900 Desired point

… 0.00005742

最多 0.00005742 0.901 Actual maximum

鲫鱼 0.00000124 1.000-

黛 0.00000124 1.000-

The Multi-Pronunciation Expansion Criterion

Mu

lti-P

ronu

ncia

tion

E

xpan

sion

Sin

gle-

Pro

nun

ciat

ion

(Sta

ndar

d)

Page 39: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

39A

ccP

rob:

0%

mea

ns n

o m

ultip

le p

ronu

ncia

tion

expa

nsio

n, w

hile

100

% f

ull e

xpan

sion

;

Bes

t res

ult a

chie

ved

at a

sui

tabl

e A

ccP

rob

valu

e, s

ay 9

4%, w

ith V

ocS

izeR

atio

=1.

10

VocSi zeRati o CER-B0% 1. 00 63. 9

80% 1. 01 62. 9890% 1. 05 62. 9592% 1. 07 62. 9794% 1. 10 63. 0796% 1. 15 63. 15

100% 1. 87 63. 55

62. 062. 563. 063. 564. 064. 565. 065. 5

0% 80%90%92%94%96%100%

1. 00

1. 20

1. 40

1. 60

1. 80

2. 00

CER-B VocSi zeRati oBase-form MLLR + PTH-IF mapping + MPE (CER)

Page 40: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

40

VocSi zeRati o CER-S0% 1. 00 65. 47

80% 1. 04 62. 3290% 1. 12 62. 2392% 1. 17 62. 2994% 1. 24 62. 1596% 1. 35 62. 38

100% 3. 03 63. 77

62. 062. 563. 063. 564. 064. 565. 065. 5

0% 80%90%92%94%96%100%

1. 00

1. 50

2. 00

2. 50

3. 00

3. 50

CER-S VocSi zeRati oAcc

Pro

b: 0

% m

eans

no

mul

tiple

pro

nunc

iatio

n ex

pans

ion,

whi

le 1

00%

ful

l exp

ansi

on;

Bes

t res

ult a

chie

ved

at a

sui

tabl

e A

ccP

rob

valu

e, s

ay 9

4%, w

ith V

ocS

izeR

atio

=1.

24

Surface-form MLLR + WDC-IF mapping + MPE (CER)

Page 41: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

41

Bes

t res

ult a

chie

ved

at a

sui

tabl

e A

ccP

rob

valu

e,

say

94%

, with

Voc

Siz

eRat

io=

1.24

62. 062. 563. 063. 564. 064. 565. 065. 566. 0

0% 80% 90% 92% 94% 96% 100%

Acc

Pro

b: 0

% m

eans

no

mul

tiple

pro

nunc

iatio

n ex

pans

ion,

whi

le 1

00%

ful

l exp

ansi

on;

Base-form MLLR + PTH-IF mapping + MPE (CER)

Surface-form MLLR + WDC-IF mapping + MPE (CER)

Page 42: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

42

55

60

65

70

75

80

85

Basel i ne PTH- Mappi ng WDC- Mappi ng MPE

Methods

CER%

AO AY GM GF EL EH MA MS TotalPer

form

ance

imp

rove

men

t co

mp

aris

on:

over

all,

and

in te

rms

of s

peak

er c

lust

ers

Page 43: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

Q: How about recognizing PTH using the resulted WDC recognizer?

We obtain WDC recognizer from PTH recognizer;

We get a CER reduction of over 10% when recognizing WDC on an average;

How about using it to recognize PTH?

Page 44: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

44

sh

s

shs

Adaptation

(Conventional Method)

sh

s

sh

s

Rule+MPE

(Our method)

Page 45: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

45

We can expect that using WDC recognizer to recognize PTH, the performance will degrade;

But we would expect it will not decrease too much;

Results: using WDC recognizer, you getOver 10% CER reduction to recognize WDC;0.62% CER increase to recognize PTH.

Page 46: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

46

Conclusions:

The use of knowledge is useful and effective

In this project, there are several problems to solve: channel, speaking-style, dialect background, and domain problems.

It is easier to solve all these problems by simply using the adaptation method;

Our method focuses only on the dialect problem;

The results using our method could be better if we integrate those methods related to channel, and speaking-style.

Page 47: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

47State-Dependent Phoneme-Based Model Merging (SDPBMM)

At acoustic level, approaches include: Retraining the AM based on the standard speech and a certain amount

of dialectal speech Interpolation between standard speech-based HMMs and their

corresponding dialectal speech based HMMs Combination of AM with state-level pronunciation modeling Adaptation with a certain amount of dialectal speech based on the

standard speech-based AM

Existing problems: A large amount of dialectal speech to build dialect-specific acoustic

models The acoustic model cannot demonstrate good performance in standard

speech as well as dialectal speech recognition Some acoustic modeling methods are too complicated to be deployed

readily

Page 48: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

48

What we proposed:

Taking a precise context-dependent HMM from the standard speech and its corresponding less precise context-independent HMM from dialectal speech into consideration simultaneously

Merging HMMs on a state-level basis according to certain criteria

Page 49: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

49

b-an+d[2] …

L_Stop?

R_Nasal?

L_Bilabial?

L_Labial?

y

y

y

y

n

nn

n

*-an+*[2]

l-an+d[2] …

l-an+m[2] …

f-an+m[2] …

b-an+m[2] …

an[2] / ang[2]

Standard Chinese Tri-XIF Dialectal Chinese Mono-XIF

Illustration for SDPBMM

Page 50: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

50

( ) ( ) ( ) ( ) ( )

1

1

( ) ( )

1

( ) ( ) ( ) ( )

1 1

' 1 ,

( ) ; ;

( )

1 ( )

Msc sc dc dc sc

i i im im iim

K

i ik ik ikk

Ksc sc

ik ikk

M Ndc sc dc dc

im i imn imnm n

p x s p x s p x s s p s s

p x s w N x

w N

P s s w N

pdf for merged state

Page 51: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

51

The seen disadvantage so farThe scale of Gaussian mixtures in the merged state

is expanded

Is it possible to downsize the scale?A straightforward criterion is distance measure

The larger distance, the more coverage acoustically merging, if distance (d,s) threshold no-merging, if distance (d,s) < threshold

Page 52: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

52

Pseudo-divergence (PD) based distance measure between two states is defined as follows,

11

22

,

1 1

1

,

, ,

1 1, ln

8 2 2

,1

, , 2

where

,

,

and

/ 2

is the Bhattachyaryya distance meas

P Q

M N

Pi Qj P Q

i j

T P QP Q

P Q P Q

P Q

A B A B B Adistance

DispersionPD

Dispersion

Dispersion P Q w w d i j

d P Q

PD PD

P Q

P P

ure.

Page 53: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

53

b-an+d[2] …

L_Stop?

R_Nasal?

L_Bilabial?

L_Labial?

y

y

y

y

n

nn

n

*-an+*[2]

l-an+d[2] …

l-an+m[2] …

f-an+m[2] …

b-an+m[2] …

an[2] / ang[2]

Standard Chinese Tri-XIF Dialectal Chinese Mono-XIF

Distinguishable states

Page 54: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

5454

Data set Database Details Usage

PTH_Train Standard Chinese120 speakers, 20 hours, 24,000 long sentences

To bulid Putonghua AM

PTH_Test Standard Chinese12 speakers, 2.5 hours, 2,400 long sentences

Putonghua Test set

Min_DevMin-dialectal Chinese

20 speakers, 1.0 hour, 1,000 long sentences

Adaptation/SDPBMM/pronunciation modeling etc.

Min_TestMin-dialectal Chinese

16 speakers, 50 minutes, 800 long sentences

Dialectal Chinese test set

Wu_DevWu-dialectal Chinese

10 speakers, 40 minutes, 510 long sentences

Adaptation/SDPBMM/pronunciation modeling etc.

Wu_TestWu-dialectal Chinese

20 speakers, 1.0 hour, 910 long sentences

Dialectal Chinese test set

Data division

Page 55: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

55

Training setApproximately 30 hours from MBN (HUB-4); totally 34,493 utterances

Modeling methodHMM-based Decision-tree-based state-clustered cross-word tri-XIF

Topology3 left-to-right states per tri-XIF, 14 mixtures per state

Number of tri-XIFs 7,411

Number of states 3,230

Number of mixtures 45,220

Features 39 MFCC+ , , /CMN

Lexicon 406 toneless Chinese syllables

Standard Chinese-based HMMs (Baseline)

Page 56: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

56

Acoustic modelPutonghua SDPBMM+PDBDM

Gaussians 45,220 58,786

SER on Wu_Test 49.8% 43.9% (-5.9%)

SER on PTH_Test 30.5% 31.1% (+0.6%)

Evaluations on Putonghua and Wu-dialectal Chinese

Page 57: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

57

Integration of SDPBMM with adaptation

Page 58: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

58

Conclusions:Simple but effective acoustic modeling approach

using only a small amount dialectal speech dataSignificantly effective for the dialectal Chinese

speech recognition.Good performance for both standard and dialectal

speech recognition.Comparable to adaptation methodsAdditive and complementary to adaptation methods

Page 59: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Motivation Goal Knowledge Data Collection Workshop Conclusion I SDPBMM Conclusion II

59

References Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging with Pronunciation

Modeling Based on a Small Data Set for Dialectal Chinese Speech Recognition. Speech Communication, Second Review. Linquan Liu, Thomas Fang Zheng, Makoto Akabane, Ruxin Chen,Wenhu Wu. Using a Small Development Data Set to

Build a Robust Dialectal Chinese Speech Recognizer, Interspeech, Antwerp, 2007. Linquan Liu, Thomas Fang Zheng, Wenhu Wu. State-Dependent Phoneme-Based Model Merging for Dialectal Chinese

Speech Recognition, ISCSLP, Singapore, 2006. (Also collected by Lecture Notes in Artificial Intelligence, 4274, pp. 282-293, 2006. )

Jing Li, Thomas Fang Zheng, William Byrne and Dan Jurafsky, “A dialectal Chinese speech recognition framework,” J. of Computer Science and Technology, 21(1): 106-115, Jan. 2006

http://www.clsp.jhu.edu/ws04 XIONG Zhenyu, ZHENG Fang, LI Jing and WU Wenhu, “An automatic prompting texts selecting algorithm for di-IFs

balanced speech corpus,” National Conference on Man-Machine Speech Communications (NCMMSC7), pp. 252-256, Nov. 23-25, 2003, Xiamen

Thomas Fang Zheng, “Making Full Use of Chinese Speech Corpora,” Invited Keynote Speech, Oriental-COCOSDA, pp.9-23, Oct. 1-3, 2003, Sentosa, Singapore

Jing Li, Fang Zheng, Zhenyu Xiong, and Wenuhu Wu, “Construction of Large-Scale Shanghai Putonghua Speech Corpus for Chinese Speech Recognition,” Oriental-COCOSDA, pp.62-69, Oct. 1-3, 2003, Sentosa, Singapore

Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling,” ICSLP’2002, pp. 2461-2464, Sep. 16-20, 2002, Colorado, USA

Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne. “Mandarin Pronunciation Modeling Based on CASS Corpus,” J. Computer Science & Technology, 17(3): 249-263, May 2002

Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Variation Modeling,” National Conference on Man-Machine Speech Communications (NCMMSC6), pp.K51-64, 20-22 Nov 2001, Shenzhen (Invited Keynote Speech)

Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark

W. Byrne, V. Venkataramani, T. Kamm, T. F. Zheng, Z. Song, P. Fung, Y. Liu, U. Ruhi, "Automatic generation of pronunciation lexicons for Mandarin spontaneous speech," ICASSP, May 7-11, 2001, Salt Lake City, USA

Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne. “Mandarin pronunciation modeling based on CASS corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing

Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation Modeling of Mandarin Casual Speech,” Final Report for Workshop 2000 for Language Engineering for Students and Professionals Integrating Research and Education, http://www.clsp.jhu.edu/ws2000/final_reports/mpm/.

Page 60: Center for Speech and Language Technologies, Tsinghua University Dialectal Chinese Speech Recognition Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University,

Center for Speech and Language Technologies, Tsinghua University

Thanks !

http://cslt.riit.tsinghua.edu.cn/~fzheng