an acoustic profile of speech efficiency r.j.j.h. van son, barbertje m. streefkerk, and louis c.w....

AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY

R.J.J.H. van Son, Barbertje M. Streefkerk, andLouis C.W. Pols

Institute of Phonetic Sciences / ACLC University of Amsterdam,Herengracht 338, 1016 CG Amsterdam, The Netherlandstel: +31 20 5252183; fax: +31 20 5252197email: [email protected]

ICSLP2000, Beijing, China, Oct. 20, 2000

INTRODUCTION

• Speech is "efficient": Important components are emphasized

Less important ones are de-emphasized

• Two mechanisms:1) Prosody:

Lexical Stress and Sentence Accent (Prominence)

2) Predictability: Frequency of Occurrence (tested) and

Context (not tested)

MECHANISMS FOR EFFICIENT SPEECH

Speech emphasis should mirror importance which largely corresponds to unpredictability

• Prosodic structure distributes emphasis according to importance (lexical stress, sentence accent / prominence)

• Speakers can (de-)emphasize according to supposed (un)importance

• Speech production mechanisms can facilitate redundant speech or hamper unpredictable speech

QUESTIONS

• Can the distribution of emphasis or reduction be completely explained from Prosody? (Lexical stress

and Sentence Accent / Prominence)

• If not, can we identify a speech production mechanism that would assist efficiency in speech?

e.g. preprogrammed articulation of redundant and / or high-frequent syllable-like segments?

Unstressed Stressed Total Corpus Accent – + – +

Single consonants 550 180 569 283 1582Speaker vowels 812 461 528 224 2025

Polyphone vowels 4435 4942 9603 3516 22496

• Accent: Sentence accent / Prominence• Stressed/Unstressed: Lexical stress

SPEECH MATERIAL (DUTCH)• Single Male Speaker: Vowels and Consonants Matched Informal and Read speech, 791 matched VCV pairs • Polyphone: Vowels only 273 speakers (out of 5000), telephone speech, 1244 read sentences Segmented with a modified HMM recognizer (Xue Wang)

• Corpora sizes: Number of realizations of vowels and consonants

METHODS: SPEECH PREPARATION

• Single speaker corpus– All 2 x 791 VCV segments hand-labeled– Also sentence accent determined by hand– 22 Native listeners identified consonants from this corpus

• Polyphone corpus– Automatically labeled using a pronunciation lexicon and a

modified HMM recognizer – 10 Judges marked prominent words (prominence 1-10)

• Word and Syllable -log2(Frequencies) for both

corpora were determined from Dutch CELEX

METHODS: ANALYSISSingle Speaker Corpus

Consonants and Vowels

• Duration in ms (vowels and consonants)

• Contrast (vowels only) F1 / F2 distance to (300, 1450) Hz in semitones

• Spectral Center of Gravity (CoG) (V and C)Weighted mean frequency in semitones at point of maximum energy

• Log2(Perplexity) from consonant identification

Calculated from confusion matrices

METHODS: ANALYSISPolyphone Corpus

Vowels only

• Loudnessin sone

• Spectral Center of Gravity (CoG)Weighted mean frequency in semitones averaged over the segment

• Prominence (1-10)The number of 'PROMINENT' listener judgements0 – 5 is considered Unaccented6 –10 is considered Accented

• Duration in ms • Loudness in sones • CoG: Spectral Center of Gravity (semitones) • Px: log2(Perplexity) plotted is –R• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)

CONSISTENCY OF MEASUREMENTSCorrelation coefficients between factors

Single Speaker

Polyphone

}

Filled symbols: P<=0.01

2;F

JP;PF

F

BB

BB

E

JJ

S

S

AA

2

Z

2

H

H

H

H

;

;

;

B

J

P

Z

H

+ + Total

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

AccentUnstressed Stressed— —

Cor

rela

tion

Coe

ffici

ent

-> R

Consonants (n=1582)

Vowels (n=2025)

Polyphone (n=22496)

GESA2

GIC

Filled: p<=0.01

Duration x CoGDuration x PxCoG x PxDuration x Contr.Duration x CoG

Loudness x CoG

Contrast x CoG

CONSONANT REDUCTION VERSUS FREQUENCY OF OCCURRENCE

(correlation coefficients)

• CoG: Spectral Center of Gravity (semitones)

• Perplexity: log2(Perplexity), plotted is –R.

• Syllable and word frequencies were correlated (R=0.230, p=0.01)

Single speaker corpus(n=1582)


G

BB

B B

G

BB

B

E

JJ

J

J J

JA

A

AF

A A

0

0.05

0.10

0.15

0.20

0.25

0.30

0.35Syllable frequencies Word frequencies

+ + TotalAccentUnstressed Stressed

— —

Cor

rela

tion

Coe

ffici

ent

-> R

+ + TotalUnstressed Stressed

— —

A

GEA

DurationCoGPerplexity

Filled: p<=0.01

• Duration in ms• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)

• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies were correlated (R=0.280, p<=0.01)

VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE


Single speaker corpus(n=2025)


GEA

DurationCoGContrast

Filled: p<=0.01

B

G

B

B

G

G

B G

BE

J

J

E

J

E

J

E

E

F

F

F FF

F

A

0

0.05

0.10

0.15

0.20

0.25

0.30

0.35Syllablefrequencies Wordfrequencies


— —

Cor

rela

tion

Coe

ffici

ent

-> R


— —

A

DISCUSSION OF SINGLE SPEAKER DATA

• There are consistent correlations between frequency of occurrence and “acoustic reduction” (duration, CoG and contrast), but not for consonant identification (perplexity)

• Correlations for syllable frequencies tend to be larger than those for word frequencies (p0.01)

• Correlations were found after accounting for Phoneme identity, Lexical Stress and Sentence Accent

• Loudness (sone)• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies (-log2(freq))

PROMINENCE VERSUS VOWEL REDUCTION AND FREQUENCY OF OCCURRENCE


Polyphone corpus(n=22496)

Filled symbols: P<=0.01B

B

B

JJ J H

H

HF

F

F

– + Total – + Total0

0.1

0.2

0.3

0.4

0.5C

orre

latio

n C

oeffi

cien

t ->

R

Lexical stress

Loudness

CoGSyllable

freq.

Word freq.G LoudnessE CoGC Syllable freq.A Word freq.Filled: p<=0.01

VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE


Polyphone corpus (n=22496)

• Loudness (sone)• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies were correlated (R=0.316, p<=0.01)


Accent: + Prom > 5– Prom <= 5

GE

LoudnessCoG

Filled: p<=0.01


— —

Cor

rela

tion

Coe

ffici

ent

-> R


— —

EE

E E

EE

E EJ

G

B

B

B

B

B

Syllablefrequencies Wordfrequencies

0

0.02

0.04

0.06

0.08

0.10

DISCUSSION OF POLYPHONE DATA

• Perceived prominence correlates with “acoustic vowel reduction” (loudness, CoG) and frequency of occurrence (syllable and word)

• There are small but consistent correlations between “acoustic vowel reduction” and frequency of occurrence

• Correlations were found after accounting for Vowel identity, Lexical Stress and Prominence

CONCLUSIONS• LEXICAL STRESS and

SENTENCE ACCENT / PROMINENCE cannot explain all of the “efficiency” of speech: FREQUENCY OF OCCURRENCE and possibly CONTEXT in general are needed for a full account

• A SYLLABARY which speeds up (and reduces) the articulation of “stored”, high-frequency, syllables with respect to “computed”, rare, syllables might explain at least part of our data

SPOKEN LANGUAGE CORPUSHow Efficient is Speech

• 8-10 speakers: ~60 minutes of speech each (fixed and variable materials)

• Informal story telling and retold stories ~15 min• Reading continuous texts ~15 min• Reading Isolated (Pseudo-) sentences ~20 min• Word lists ~ 5 min• Syllable lists ~ 5 min

MEASURINGSPEECH EFFICIENCY

• Speaking Style differences(Informal, Retold, Read, Sentences, Lists)

• Predictability– Frequency of Occurrence (words and syllables)– In Context (language models)– Cloze-tests– Shadowing (RT or delay)

• Acoustic Reduction– Segment identification– Duration– Spectral reduction

an acoustic profile of speech efficiency r.j.j.h. van son, barbertje m. streefkerk, and louis c.w....

Documents