di ss2013 fillerspt_presentation_final

30
Jorge Proença 1 Dirce Celorico 1 Arlindo Veiga 1,2 Sara Candeias 1 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Coimbra, Portugal 2 University of Coimbra, DEEC, Portugal Acoustical Characterization of Vocalic Fillers in European Portuguese The 6th Workshop on Disfluency in Spontaneous Speech Stockholm, Sweden August 21-23, 2013

Upload: sara-candeias

Post on 11-Jul-2015

48 views

Category:

Sports


0 download

TRANSCRIPT

Page 1: Di ss2013 fillerspt_presentation_final

Jorge Proença 1

Dirce Celorico 1

Arlindo Veiga 1,2

Sara Candeias 1

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal

Acoustical Characterization of

Vocalic Fillers in European Portuguese

The 6th Workshop on Disfluency in Spontaneous Speech

Stockholm, Sweden August 21-23, 2013

Page 2: Di ss2013 fillerspt_presentation_final

2

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 3: Di ss2013 fillerspt_presentation_final

3

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 4: Di ss2013 fillerspt_presentation_final

4

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Studying events that characterize the spontaneity of the speech has been

increasingly relevant as the development of speech technologies grows.

Studies on hesitations (so-called disfluencies) as well as vowel reductions

have gained importance over the last years.

Hesitation phenomena:

repetitions, truncated words, word fillers;

vocalic extensions into words;

filled pauses.

SCOPE

Page 5: Di ss2013 fillerspt_presentation_final

5

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Hesitation phenomena:

repetitions, truncated words, word fillers;;

vocalic extensions into words;

filled pauses.

The most occurring, mainly on

spontaneous speech;

Occurring without any lexical

support;

Mostly fulfilled by relatively

stable vocalic segments.

Our previous studies say…

Filled pause vocalizations

or

Vocalic fillers (VFs)

Studying events that characterize the spontaneity of the speech has been

increasingly relevant as the development of speech technologies grows.

Studies on hesitations (so-called disfluencies) as well as vowel reductions

have gained importance over the last years.

SCOPE

Page 6: Di ss2013 fillerspt_presentation_final

6

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Vocalic fillers (VFs)Why ?

represent an insertion at any moment during spontaneous speech;

carry multiple functions in the communication performance:

announcing upcoming discursive topics,

planning and delaying speech,

SCOPE

To develop an automatic detector of fillers

from continuous speech.

Page 7: Di ss2013 fillerspt_presentation_final

7

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 8: Di ss2013 fillerspt_presentation_final

8

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Studying the two most common VFs in European Portuguese:

the near-open central vowel [ɐ],

mid-central vowel [ə].

OBJECTIVE

How ? Acoustically characterizing VFs.

Analyzing:

first and second formant frequencies,

duration and

variation rates.

Comparing with lexical vowels (LVs) of similar timbre.

Page 9: Di ss2013 fillerspt_presentation_final

9

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 10: Di ss2013 fillerspt_presentation_final

10

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

HESITA Database - manually annotated filled pauses

CORPUS

Page 11: Di ss2013 fillerspt_presentation_final

11

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download

Extract audio stream

and downsample from

44.1kHz to 16 kHz

HESITA Database - manually annotated filled pauses

Source:

30 daily news programs (~ 27 hours)

CORPUS

Page 12: Di ss2013 fillerspt_presentation_final

12

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download

Extract audio stream

and downsample from

44.1kHz to 16 kHz

HESITA Database - manually annotated filled pauses

Source:

30 daily news programs (~ 27 hours)

[ɐ] and [ə] VFs

the most common in the database and

chosen for analysis 808 [ɐ] and 344 [ə] occurrences.

CORPUS

Page 13: Di ss2013 fillerspt_presentation_final

13

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download

Extract audio stream

and downsample from

44.1kHz to 16 kHz

HESITA Database - manually annotated filled pauses

Source:

30 daily news programs (~ 27 hours)

[ɐ] and [ə] VFs

the most common in the database and

chosen for analysis 808 [ɐ] and 344 [ə] occurrences.

Curiosity: Next most common VFs – nasal [ɐ], with 155 occurrences.

CORPUS

Page 14: Di ss2013 fillerspt_presentation_final

14

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Studying the two most common VFs in European Portuguese:

the near-open central vowel [ɐ],

mid-central vowel [ə].

Acoustically characterizing VFs.

Analyzing:

first and second formant frequencies,

duration and

variation rates.

Comparing with lexical vowels (LVs) of similar timbre.

OBJECTIVE - revisited

Page 15: Di ss2013 fillerspt_presentation_final

15

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

A control corpus was used to estimate the acoustic characteristics of the

vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),

such as:

[ɐ] in <para> [pɐrɐ], (‘for’ in English)

[ə] in <devolver> [dəvolver] (‘to give back’)

CORPUS

Page 16: Di ss2013 fillerspt_presentation_final

16

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

A control corpus was used to estimate the acoustic characteristics of the

vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),

such as:

[ɐ] in <para> [pɐrɐ], (‘for’ in English)

[ə] in <devolver> [dəvolver] (‘to give back’)

LVs extracted from a read speech Database

recordings from 7 European Portuguese native adult speakers,

sentences and command words,

a segmentation and phone-level transcription were automatically

performed through forced alignment using in-house tools.

CORPUS

Page 17: Di ss2013 fillerspt_presentation_final

17

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

A control corpus was used to estimate the acoustic characteristics of the

vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs) ,

such as:

[ɐ] in <para> [pɐrɐ], (‘for’ in English)

[ə] in <devolver> [dəvolver] (‘to give back’)

LVs extracted from a read speech Database

The total number of extracted LVs was 7426, in which we count:

4411 [ɐ],

3015 [ə].

Type Gender #[ɐ] #[ə]

VFsMale 605 301

Female 203 43

LVsMale 2674 1771

Female 1737 1244

Number of extracted (#) of VFs and LVs by gender and timbre.

CORPUS

Page 18: Di ss2013 fillerspt_presentation_final

18

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 19: Di ss2013 fillerspt_presentation_final

19

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Praat tool

The base recommended ceilings for estimating 5 formants (5500Hz for

female speakers and 5000Hz for male speakers) but, through observation,

these values cannot always successfully estimate F1 and F2.

FORMANT FREQUENCY DETERMINATION

Page 20: Di ss2013 fillerspt_presentation_final

20

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Praat tool

The base recommended ceilings for estimating 5 formants (5500Hz for

female speakers and 5000Hz for male speakers) but, through observation,

these values cannot always successfully estimate F1 and F2.

Different vowels and speakers need different formant ceilings for automatic

calculation.

FORMANT FREQUENCY DETERMINATION

Iterative method:

ceilings chosen in the 4000-5500Hz range (for males) or 4800-6500Hz

(for females) in 50Hz steps, each 10ms;

selection of the optimal ceiling for a given VF – the one that provide the

smallest variance of the F1 and F2 pairs of values of that VF, calculated

as the sum of the variances of 20 log(F1) and 20 log(F2).

No speaker information kept for most of the news broadcast VFs: they were

considered as if each belonged to a different speaker.

Page 21: Di ss2013 fillerspt_presentation_final

21

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Utterances with high clipping of the audio signal were discarded.

For each utterance, only the formant values where the energy level was

above of 10% of the maximum energy were considered, to specifically

discard possible unvoiced boundary segments.

Utterances with highly variant formant values, probably indicating a failure

in detecting F1 and F2 were not considered.

The same analysis was conducted for the LVs [ɐ] and [ə] with an additional

restriction of only considering segments of duration larger than 50ms.

OTHER APPLIED RESTRICTIONS

Page 22: Di ss2013 fillerspt_presentation_final

22

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 23: Di ss2013 fillerspt_presentation_final

23

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

After applying the restrictions:

VFs: 520 [ɐ] and 244 [ə]

LVs: 1517 [ɐ] and 385 [ə]

Very large number of LVs were cut from analysis:

mostly the small-duration or low-energy segments, barely recognized

during alignment and more drastically occurring for [ə], as a

consequence of the nature of continuous speech.

RESULTS

Page 24: Di ss2013 fillerspt_presentation_final

24

RESULTS

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

2

4

6

8

10

12

14

16

18

duration (s)

LV

VF

Normalized histogram of the duration of VFs and LVs

DURATION

As expected, VFs are longer

Page 25: Di ss2013 fillerspt_presentation_final

25

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).

RESULTS

F1 and F2 of [ɐ] and [ə] means and 2-sigma concentration ellipsoids

‘triangle’ of [i], [ɛ], [a], [ɔ] and [u] from read speech corpus (calculated in a similar fashion to

the method described) was included to show the centrality of [ɐ] and [ə].

Average – F1 higher, F2 lower for VFs against LVs

Distributions overlap

M F

Page 26: Di ss2013 fillerspt_presentation_final

26

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).

RESULTS

F1 and F2 of [ɐ] and [ə]

LVs show the highest variances:

- high dependence of phonetic context and

- related coarticulation phenomenon

[ɐ] and [ə] VFs are hard to distinguish (they are both in the middle-point)

M F

Page 27: Di ss2013 fillerspt_presentation_final

27

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

RESULTS

VARIATION RATES Linear fit applied (although change can be non-linear)

High-variability

Average – small negative changes, but no trend can be discerned

LVs – less stable

No correlation between F1 and F2 simultaneous variation

-6000 -4000 -2000 0 2000 4000 6000

0

0.5

1

1.5

2

x 10-3

F1 variation rate (Hz/s)

VF

LV

-6000 -4000 -2000 0 2000 4000 6000

0

0.2

0.4

0.6

0.8

1

x 10-3

F2 variation rate (Hz/s)

VF

LV

Normalized histogram of the variation rates of F1 (left) and F2 (right) from a linear fit to each

utterance, for VFs and LVs.

Page 28: Di ss2013 fillerspt_presentation_final

28

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

Page 29: Di ss2013 fillerspt_presentation_final

29

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Each speaker could have its own personal preference on how to fill a

pause vocalically.

Choosing mainly sounds of the central vowels system, speakers appear

to adapt the production with their own specific production, possibly even

in a middle point of [ɐ] and [ə].

The main characteristic is of long stable segments.

New data of sustained vowels productions (including [ɐ] and [ə]) to better

distinguish the fillers.

A perceptual study to confirm that some vocalic fillers can be understood

differently with and without context or for different listeners.

Based on the knowledge attained from this study, to develop an automatic

detector of fillers and extensions from continuous speech.

Thank You

CONCLUSIONS AND FUTURE WORK

Page 30: Di ss2013 fillerspt_presentation_final

The 6th Workshop on Disfluency in Spontaneous Speech

Stockholm, Sweden August 21-23, 2013

Jorge Proença 1

Dirce Celorico 1

Arlindo Veiga 1,2

Sara Candeias 1

([email protected])

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal

Acoustical Characterization of

Vocalic Fillers in European Portuguese