study on keyword spotting using prosodic attribute ... · g g£, /Ï =Ô4 #~+µz zfg e c z óg¼j...

15
Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech Yu-Jui Huang Department of Computer Science and Information Engineering National Chia-Yi University [email protected] Yin-Wei Chung Department of Computer Science and Information Engineering National Chia-Yi University [email protected] Jui-Feng Yeh Department of Computer Science and Information Engineering National Chia-Yi University [email protected] (SVM) SVM Abstract It is one of most essential issues to extract the keywords from conversational speech for understanding the utterances from speakers. This thesis aims at keyword spotting from spontaneous speech for keyword detecting. We proposed prosodic features that are used for keyword detection. The prosody words are segmented from speakers utterance according to the pre-training decision tree. The supported vector machine is further used as the classifier to judge the prosody word is keyword or not. The prosody word boundary segmentation algorithm based on decision tree is illustrated. Besides the data driven feature, the knowledge obtained from the corpus observation is integrated in the decision tree. Finally, the keyword Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012) 231

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

Study on Keyword Spotting using Prosodic Attribute Detection for

Conversational SpeechYu-Jui Huang

Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]

Yin-Wei Chung

Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]

Jui-Feng Yeh

Department of Computer Science and Information EngineeringNational Chia-Yi University

[email protected]

(SVM)

SVM

Abstract

It is one of most essential issues to extract the keywords from conversational speech for understanding the utterances from speakers. This thesis aims at keyword spotting from spontaneous speech for keyword detecting. We proposed prosodic features that are used for keyword detection. The prosody words are segmented from speaker’s utterance according to the pre-training decision tree. The supported vector machine is further used as the classifier to judge the prosody word is keyword or not. The prosody word boundary segmentation algorithm based on decision tree is illustrated. Besides the data driven feature, the knowledge obtained from the corpus observation is integrated in the decision tree. Finally, the keyword

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

231

Page 2: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

in the focus part are extracted using prosody features by sported vector machine (SVM). According to the experimental results, we can find the proposed method outperform the phone verification approach especially in recall and accuracy. This shows the proposed approach is operative for keyword detecting.

Keywords: Keyword spotting, prosodic feature, prosody word, spoken language.

(Keyword spotting)

( )(Spontaneous speech)

(Dialogue system) (Speaking style)(Grammar)

(Real time)Kawahara (Keyword extraction)

(Verification)(Key-phrase detection) (Key-phrase verification) (Sentence parsing)

(sentence verification)(Incremental understanding) [1]

Charpter [2]

(Spoken Language Understanding, SLU)

(Knowledge based) [3](Prosodic attribute)

(Hierarchical Prosodic Phrase Grouping, HPG)[4][5] (Prosodic word)

Ali[1] Wieland

Bi-gram Beam-search Viterbi[6] Bitar

HMM[7] Rabiner 1989

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

232

Page 3: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

[8] Tatsuya Kawahara Chin-Hui Lee Key-Phrase Detection Verification

[9]

Rose[10] HMM

(filler) Zhang[11]

Bahi[12]

HMMBazzi

HMM [13]

Lee C.H.[14]Kim[15]

[16][17]

Haizhou Li, Bin Ma, and Chin-Hui Lee [18]

AuToBi [19]POS HMM

Conkie [20]delta HMM

Sridhar[21] HMMHMM

Erteschik-shir [22]

[23]

[24][25]

[26]MFCC

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

233

Page 4: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

[27] SVM

HPGFujisaki Model

[4][5][28][29][30] HPG

1

1:

(Prosodic Attributes Extraction)(Pitch) (Intensity) (Duration)

(HPG)(Prosodic Word Boundary) (Boundary Decision Tree)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

234

Page 5: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

SVM(Keyword Detector)

(Prosodic Word Detection)(Keyword Detection)

(syllable) (prosodic word)(intonation phrase)

(Hierarchical Prosodic Phrase Grouping, HPG)[4][5]

(syllable, Syl)(prosodic word, PW) (prosodic phrase, PPh) (breath-group)

(prosodic phrase group, PG)B1 B2 B3 B4 B5

B5 B1

B2

2 B

2 (Prosody word)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

235

Page 6: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

[4][5]

3 9

>

Case 1

Case 2

Case 3

Case 4 Case 5 Case 6 Case 7 Case 8 Case 9

(Pause)

(Pitch Reset) (Pitch Reset)

3 HPG

9(Pitch reset)

1

1 HPG

Case 1 >

Case 2

Case 3

Case 4 (Pitch reset)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

236

Page 7: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

Case 5

Case 6

Case 7

Case 8 (Pitch reset)

Case 9

(1) =0.040.03 0.05

0.04

(2)( 1)

(slope) i

( )i i iP t t� �� � ( 1)

Pi(t) i t i i biei i 2

2

( )( ( ) ), [ , ]

( )

i

i

i

i

e

i it b

i i ie

t b

t t P t Pt b e

t t� �

� �� �

�( 2)

t 3 iP i4 n

1 ( )2 i it e b� � ( 3)

1 ( )i

i

e

i it b

P P tn �

� � ( 4)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

237

Page 8: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

i upper bound lower boundi upper bound i lower bound

upper bound i lower boundcase 4 pitch reset case 8 pitch reset

SVM(predict) +1

-1 5

1, 1,

ii

if T is semantic objectT

otherwise�

� ��

( 5)

SVM

(1)

01-10 ini ijP i j

1 2{ , ,... }i i in iP P P PW� DurijP i j Bi

Ei _ iSyl N i _ijSyl bi j _ijSyl e i j

(2)

bpauseepause 11

(3)

1312-13

13

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

238

Page 9: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

(Keyword spotting) (Speech act)(Semantic slot)

DA pair Erteschik-shir [23]

(Topic) (Focus)

(Pragmatics)

4 54

5

4: DA pairs 5: DA pairs

52 247 56873

173 1061 850211

850 2498 211660

(True Positive, TP)(False Negative, FN)

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

239

Page 10: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

(True Negative, TN)(False Positive, FP) 6

6

(accuracy) (precision)(recall) 6 7 8

TP TNaccuracyTP FP TN FN

��

� � � ( 6)

TPprecisionTP FP

��

( 7)

TPrecallTP FN

��

( 8)

(1) SVMSVM

23-5% 58% 80%

10 858%

2: SVM

accuracy precision recall

4 5 12 13(c=1 g=8) 77.16% 57.83% 68.25%

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

240

Page 11: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

4 5 12 13(c=10 g=16) 74.10% 52.90% 69.19%

4 5 11 12 13(c=1 g=8) 77.42% 58.17% 69.19%

4 5 11 12 13(c=10 g=16) 74.10% 52.94% 68.45%

3 5 6-9 12(c=1 g=8) 75.83% 54.69% 80.0%

3 5 6-9 12(c=10 g=16) 73.04% 51.25% 77.73%

4 5 6-8 12(c=1 g=8) 74.90% 54.01% 70.14%

4 5 6-8 12(c=10 g=16) 71.58% 49.5% 70.62%

(2) SVMSVM

3100%

SVMTP

3: SVM

accuracy precision recall

4 5 12 13(c=1 g=8) 83.38% 70.95% 75.33%

4 5 12 13(c=10 g=16) 81.40% 65.41% 78.03%

4 5 11 12 13(c=1 g=8) 83.51% 70.83% 75.56%

4 5 11 12 13(c=10 g=16) 81.35% 64.91% 77.48%

3 5 6-9 12(c=1 g=8) 82.45% 66.33% 85.15%

3 5 6-9 12(c=10 g=16) 80.61% 63.00% 84.00%

4 5 6-8 12(c=1 g=8) 80.47% 65.02% 75.33%

4 5 6-8 12(c=10 g=16) 76.65% 58.42% 75.22%

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

241

Page 12: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

[14] HTK forced alignmentHMM

(filler) 4

15%

4

accuracy precision recall

Reference 68% 70.22% 68.45%

Label + SVM 77.42% 58.17% 80%

Decision Tree + SVM 83.51% 70.95% 85.15%

HPG SVM

SVM51%~58% 68%~80% 51%~59%

(True Positive, TP) 76%~83%

58%~71% 75%~85%

(NSC 99-2221-E-415-006-MY3) .

[1] Ali, J. Van der Spiegel, P. Mueller, G. Haentjens ,and J. Berman, “An Acoustic-PhoneticFeature-Based System for Automatic Phoneme Recognition in Continuous Speech,” ISCAS 1998.

[2] N. Chater, M. Pickering, and D. Milward. “What is incremental interpretation? ” Edinburgh Working Papers in Cognitive Science, 11:1–22, 1995.

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

242

Page 13: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

[3] J. Li, Y. Tsao and C.H. Lee, “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” ICASSP, IEEE International Conference, vol 1, pp837-840, 2005.

[4] , , 11(2):183-218, 2010.[5] , ,

9.3:659-719, 2008.[6] E. Wieland, F. Gallwitz, and H. Niemann. “Combining stochastic and linguistic language

models for recognition of spontaneous speech.” In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol.1, Atlanta, May, pp 423–426, 1996.

[7] N. N. Bitar and C. Y. Espy-Wilson , “Knowledge-based Parameters for HMM Speech Recognition,” ICASSP 1996.

[8] L. R. Rabiner, “A tutorial on hidden markov models and selected application in speech recognition,” Proceedings of the IEEE, vol.77, no. 2, Feb. 1989.

[9] T. Kawahara, C.H. Lee, and B.H. Juang, “Flexible Speech Understanding Based on Combined Key-Phrase Detection and Verification”, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol.6, NO. 6, pp.558-568, 1998.

[10] R. C. Rose, D. B. Paul, “A Hidden Markov Model Based Keyword Recognition System” Acoustics, Speech, and Signal Processing, ICASSP, vol.1, Page(s): 129 - 132, 1990.

[11] P. Zhang, J. Han, J. Shao, Y. Yan, “A New Keyword Spotting Approach for Spontaneous Mandarin Speech” Signal Processing, 8th International Conference on vol.1, 2006.

[12] H. Bahi, N. Benati, “A New Keyword Spotting Approach” Multimedia Computing and Systems, ICMCS, International Conference , pp.77–80, 2009.

[13] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” Proc. ICSLP, Beijing, 2000.

[14] H. Jiang, C.H. Lee, “A new approach to utterance verification based on neighborhood information in model space”, IEEE Trans. Speech Audio Process. 11(5), pp. 425-434, 2003.

[15] T.-Y. Kim and H. Ko, “Bayesian Fusion of Confidence Measures for Speech Recognition”, IEEE SIGNAL PROCESSING LETTERS, vol.12, NO. 12, Dec 2005.

[16] Y. BenAyed, D. Fohr, J. P. Haton, G. Chollet, “Improving the Performance of a Keyword Spotting System by Using Support Vector Machines”, in IEEE Auto Speech Recogniton and Understanding Workshop ASRU, St, Thomas, U.S. Virgin islands, Dec 2003.

[17] R. Rose, “Confidence measures for the Switchboard database”, Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.511-514, 1996.

[18] H. Li, B. Ma, and C.H. Lee. “A Vector Space Modeling Approach to Spoken Language Identification”, Audio, Speech, and Language Processing, IEEE Transactions on vol. 15,

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

243

Page 14: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

NO. 1, JANUARY, pp 271-284, 2007.[19] AuToBi. http://eniac.cs.qc.cuny.edu/andrew/autobi/index.html[20] A. Conkie, G. Riccardi, and R. Rose. “Prosody recognition from speech utterances using

acoustic and linguistic based models of prosodic events”. In Eurospeech, 1999.[21] V. R. Sridhar, S. Bangalore, and S. Narayanan. Exploiting acoustic and syntactic features

for prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech & Language Processing, 16(4):797–811, 2008.

[22] N. Erteschik-shir, Information Structure: The Syntax-Discourse Interface, 2007.[23] , , ,

89[24] , , ,

89[25] , , , 93

[26] , , , 95[27] , , ,

96[28] C.Y. Tseng, “Discourse Speech Tempo”. JAIST Symposium on Modeling of Speech and

Audiovisual Mechanism. Ishikawa, Japan. 2011.[29] C.Y. Tseng, and C.H. Chang, 2007. Pause or No Pause? Phrase Boundaries

Revisited . The 9th National Conference on Man-Machine Speech CommunicationNCMMSC). , , 2007.

[30] .

280-312. , , 2008

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

244

Page 15: Study on Keyword Spotting using Prosodic Attribute ... · g g£, /Ï =Ô4 #~+µZ ZFg e c Z óG¼J Study on Keyword Spotting using Prosodic Attribute Detection for Conversational Speech

01 ( )NumiP PW i in

02 ( )DuriP PW i

1

inDurij

jP

��

03 _ ( )Dur MaxiP PW i 1 2{ , ,..., }Dur Dur Dur

i i inMax P P P

04 _ ( )Dur MiniP PW i 1 2{ , ,..., }Dur Dur Dur

i i inMin P P P

05 ( )iDur PW i ( )i i iB E Pause PW� �

06 ( )iSyl PW i _ iSyl N

07 1( )iDur Syl i 1 1 1_ _i iSyl e Syl b�

08 2( )iDur Syl i 2 2 2_ _i iSyl e Syl b�

09 3( )iDur Syl i 3 3 3_ _i iSyl e Syl b�

10 4( )iDur Syl i 4 4 4_ _i iSyl e Syl b�

11 ( )iPause PW i pause pausee b�

12 ( )ipos PW i iBE

13 ( )N Speech N

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

245