an acoustic profile of speech efficiency r.j.j.h. van son, barbertje m. streefkerk, and louis c.w....
TRANSCRIPT
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY
R.J.J.H. van Son, Barbertje M. Streefkerk, andLouis C.W. Pols
Institute of Phonetic Sciences / ACLC University of Amsterdam,Herengracht 338, 1016 CG Amsterdam, The Netherlandstel: +31 20 5252183; fax: +31 20 5252197email: [email protected]
ICSLP2000, Beijing, China, Oct. 20, 2000
INTRODUCTION
• Speech is "efficient": Important components are emphasized
Less important ones are de-emphasized
• Two mechanisms:1) Prosody:
Lexical Stress and Sentence Accent (Prominence)
2) Predictability: Frequency of Occurrence (tested) and
Context (not tested)
MECHANISMS FOR EFFICIENT SPEECH
Speech emphasis should mirror importance which largely corresponds to unpredictability
• Prosodic structure distributes emphasis according to importance (lexical stress, sentence accent / prominence)
• Speakers can (de-)emphasize according to supposed (un)importance
• Speech production mechanisms can facilitate redundant speech or hamper unpredictable speech
QUESTIONS
• Can the distribution of emphasis or reduction be completely explained from Prosody? (Lexical stress
and Sentence Accent / Prominence)
• If not, can we identify a speech production mechanism that would assist efficiency in speech?
e.g. preprogrammed articulation of redundant and / or high-frequent syllable-like segments?
Unstressed Stressed Total Corpus Accent – + – +
Single consonants 550 180 569 283 1582Speaker vowels 812 461 528 224 2025
Polyphone vowels 4435 4942 9603 3516 22496
• Accent: Sentence accent / Prominence• Stressed/Unstressed: Lexical stress
SPEECH MATERIAL (DUTCH)• Single Male Speaker: Vowels and Consonants Matched Informal and Read speech, 791 matched VCV pairs • Polyphone: Vowels only 273 speakers (out of 5000), telephone speech, 1244 read sentences Segmented with a modified HMM recognizer (Xue Wang)
• Corpora sizes: Number of realizations of vowels and consonants
METHODS: SPEECH PREPARATION
• Single speaker corpus– All 2 x 791 VCV segments hand-labeled– Also sentence accent determined by hand– 22 Native listeners identified consonants from this corpus
• Polyphone corpus– Automatically labeled using a pronunciation lexicon and a
modified HMM recognizer – 10 Judges marked prominent words (prominence 1-10)
• Word and Syllable -log2(Frequencies) for both
corpora were determined from Dutch CELEX
METHODS: ANALYSISSingle Speaker Corpus
Consonants and Vowels
• Duration in ms (vowels and consonants)
• Contrast (vowels only) F1 / F2 distance to (300, 1450) Hz in semitones
• Spectral Center of Gravity (CoG) (V and C)Weighted mean frequency in semitones at point of maximum energy
• Log2(Perplexity) from consonant identification
Calculated from confusion matrices
METHODS: ANALYSISPolyphone Corpus
Vowels only
• Loudnessin sone
• Spectral Center of Gravity (CoG)Weighted mean frequency in semitones averaged over the segment
• Prominence (1-10)The number of 'PROMINENT' listener judgements0 – 5 is considered Unaccented6 –10 is considered Accented
• Duration in ms • Loudness in sones • CoG: Spectral Center of Gravity (semitones) • Px: log2(Perplexity) plotted is –R• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)
CONSISTENCY OF MEASUREMENTSCorrelation coefficients between factors
Single Speaker
Polyphone
}
Filled symbols: P<=0.01
2;F
JP;PF
F
BB
BB
E
JJ
S
S
AA
2
Z
2
H
H
H
H
;
;
;
B
J
P
Z
H
+ + Total
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AccentUnstressed Stressed— —
Cor
rela
tion
Coe
ffici
ent
-> R
Consonants (n=1582)
Vowels (n=2025)
Polyphone (n=22496)
GESA2
GIC
Filled: p<=0.01
Duration x CoGDuration x PxCoG x PxDuration x Contr.Duration x CoG
Loudness x CoG
Contrast x CoG
CONSONANT REDUCTION VERSUS FREQUENCY OF OCCURRENCE
(correlation coefficients)
• CoG: Spectral Center of Gravity (semitones)
• Perplexity: log2(Perplexity), plotted is –R.
• Syllable and word frequencies were correlated (R=0.230, p=0.01)
Single speaker corpus(n=1582)
Filled symbols: P<=0.01
G
BB
B B
G
BB
B
E
JJ
J
J J
JA
A
AF
A A
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35Syllable frequencies Word frequencies
+ + TotalAccentUnstressed Stressed
— —
Cor
rela
tion
Coe
ffici
ent
-> R
+ + TotalUnstressed Stressed
— —
A
GEA
DurationCoGPerplexity
Filled: p<=0.01
• Duration in ms• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)
• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies were correlated (R=0.280, p<=0.01)
VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE
(correlation coefficients)
Single speaker corpus(n=2025)
Filled symbols: P<=0.01
GEA
DurationCoGContrast
Filled: p<=0.01
B
G
B
B
G
G
B G
BE
J
J
E
J
E
J
E
E
F
F
F FF
F
A
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35Syllablefrequencies Wordfrequencies
+ + TotalAccentUnstressed Stressed
— —
Cor
rela
tion
Coe
ffici
ent
-> R
+ + TotalUnstressed Stressed
— —
A
DISCUSSION OF SINGLE SPEAKER DATA
• There are consistent correlations between frequency of occurrence and “acoustic reduction” (duration, CoG and contrast), but not for consonant identification (perplexity)
• Correlations for syllable frequencies tend to be larger than those for word frequencies (p0.01)
• Correlations were found after accounting for Phoneme identity, Lexical Stress and Sentence Accent
• Loudness (sone)• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies (-log2(freq))
PROMINENCE VERSUS VOWEL REDUCTION AND FREQUENCY OF OCCURRENCE
(correlation coefficients)
Polyphone corpus(n=22496)
Filled symbols: P<=0.01B
B
B
JJ J H
H
HF
F
F
– + Total – + Total0
0.1
0.2
0.3
0.4
0.5C
orre
latio
n C
oeffi
cien
t ->
R
Lexical stress
Loudness
CoGSyllable
freq.
Word freq.G LoudnessE CoGC Syllable freq.A Word freq.Filled: p<=0.01
VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE
(correlation coefficients)
Polyphone corpus (n=22496)
• Loudness (sone)• CoG: Spectral Center of Gravity (semitones)• Syllable and word frequencies were correlated (R=0.316, p<=0.01)
Filled symbols: P<=0.01
Accent: + Prom > 5– Prom <= 5
GE
LoudnessCoG
Filled: p<=0.01
+ + TotalAccentUnstressed Stressed
— —
Cor
rela
tion
Coe
ffici
ent
-> R
+ + TotalUnstressed Stressed
— —
EE
E E
EE
E EJ
G
B
B
B
B
B
Syllablefrequencies Wordfrequencies
0
0.02
0.04
0.06
0.08
0.10
DISCUSSION OF POLYPHONE DATA
• Perceived prominence correlates with “acoustic vowel reduction” (loudness, CoG) and frequency of occurrence (syllable and word)
• There are small but consistent correlations between “acoustic vowel reduction” and frequency of occurrence
• Correlations were found after accounting for Vowel identity, Lexical Stress and Prominence
CONCLUSIONS• LEXICAL STRESS and
SENTENCE ACCENT / PROMINENCE cannot explain all of the “efficiency” of speech: FREQUENCY OF OCCURRENCE and possibly CONTEXT in general are needed for a full account
• A SYLLABARY which speeds up (and reduces) the articulation of “stored”, high-frequency, syllables with respect to “computed”, rare, syllables might explain at least part of our data
SPOKEN LANGUAGE CORPUSHow Efficient is Speech
• 8-10 speakers: ~60 minutes of speech each (fixed and variable materials)
• Informal story telling and retold stories ~15 min• Reading continuous texts ~15 min• Reading Isolated (Pseudo-) sentences ~20 min• Word lists ~ 5 min• Syllable lists ~ 5 min
MEASURINGSPEECH EFFICIENCY
• Speaking Style differences(Informal, Retold, Read, Sentences, Lists)
• Predictability– Frequency of Occurrence (words and syllables)– In Context (language models)– Cloze-tests– Shadowing (RT or delay)
• Acoustic Reduction– Segment identification– Duration– Spectral reduction