I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
1/21
Detection of Burst Onset Landmarks in Speech Using
Rate of Change of Spectral Moments
A. R. JayanP. S. Rajath Bhat
P. C. Pandey{arjayan, rajathbhat, pcpandey}@ee.iitb.ac.in
EE Dept, IIT Bombay30th January, 2011
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
2/21
PRESENTATION OUTLINE
1. Introduction Speech landmarks Landmark detection Clear speech Automated speech intelligibility enhancement
2. Methodology Band energy parameters Spectral moments Rate of change function
3. Evaluation and results VCV utterances Sentences
4. Conclusion
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
3/21
1. INTRODUCTION
Speech landmarks
Regions, associated with spectral transitions, containing important information for speech perception
Landmarks and related events [Park, 2008]
Segment type Landmark Description
Vowel Vowel (V) Vowel nucleus
Glide Glide (G) Slow formant transitions
Consonant
Glottis (g)
Sonorant (s)
Burst (b)
Vocal fold vibration
Nasal closure / release
Turbulence noise
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
4/21
Landmark detection
Processing Extraction of parameters characterizing the landmark
Computation of the rate of change (ROC) of parameters Locating the landmark using ROC(s)
Applications Intelligibility enhancement
Speech recognition Vocal tract shape estimation
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
5/21
Clear speech
Speech produced with clear articulation when talking to a hearing-impaired listener, or in a noisy environment
More intelligible for
▪ Hearing impaired listeners (~17% higher, Picheny et al.,1985)
▪ Listeners in noisy environments (Payton et al., 1994)
▪ Non-native listeners (Bradlow and Bent, 2002)
▪ Children with learning disabilities (Bradlow et al., 2003)
Pronounced acoustic landmarks
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
6/21
Conv.
Clear
Example: ‘The book tells a story’ (Recordings from http://www.acoustics.org/press/145th/clr-spch-tab.htm)
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
7/21
Automated speech intelligibility enhancement
Automated detection of landmarks
High detection rate with low false detections
Good temporal accuracy (5-10 ms)
Computational efficiency
Modification of speech characteristics
Intensity / duration / spectral modifications around landmarks with minimal perceptual distortions of the acoustic cues in the speech signal
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
8/21
Problems in stop consonant perception Transient sound with low intensity Severely affected by noise / hearing impairment
Stop landmarks: Closure Burst onset Onset of voicing
Example: /apa/
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
9/21
Some of the earlier landmark detection techniques
Liu (1996): Rate-of-rise measures of parameters from a set of fixed spectral bands (Speech recognition, g, s, b landmarks, 80 TIMIT sentences,
detection rate: 84 % at 20-30 ms, 50 % at 5-10 ms)
Salomon et al. (2002): Temporal parameters related to periodicity, envelope, spectral fine structure (Speech recognition, onsets and offsets of vowels, sonorants, & consonants, 120 TIMIT sentences, detection rate: 90 % at 20 ms)
Sainath and Hazan (2006): Sinusoidal model parameters (Speech
segmentation, 453 TIMIT sentences, word error rates: 20 % )
Niyogi & Sondhi (2002): Stop landmark detection using total energy, energy above 3 kHz & Wiener entropy (Speech recognition, stop consonants,
320 TIMIT sentences, detection rate: 90 % at 20 ms)
Jayan & Pandey (2009): Stop landmark detection using GMM parameters (Speech enhancement, 50 TIMIT sentences, detection rate: 73 % at 5 ms)
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
10/21
Improving landmark detection
Parameters ▪ Capturing spectral transitions▪ Adaptation to speech variability
Rate of change measure ▪ Range of parameter variations▪ Correlation among parameters
Adaptive time steps▪ Small time step for abrupt variations▪ Large time step for slow variations
Objective of the present investigation
Detection of burst landmarks for automated intelligibility enhancement
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
11/21
2. METHODOLOGYBand energy parameters
Log of spectral peaks in three bands ▪ b1: 1.2-2.0 kHz ▪ b2: 2.0-3.5 kHz ▪ b3: 3.5-5.0 kHz
Mag. spectrum (10 kHz sampling) computed using 512-point DFT, 6 ms Hanning window, 1 frame per ms, and smoothed by 20-point
moving average.
Smoothed mag. spectrum X(n, k) used for calculating log of spectral peak in band i
10 min max( ) 20log max ( , ),bi i iE n X n k k k k n = time index, k = frequency index
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
12/21
Example: Band energy parameters for /aga/
Time (ms)
(a) Speech waveform
(b) Band energy's
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
13/21
Spectral moments
Normalized spectrum/ 2
1
( , ) ( , ) ( , )N
k
p n k X n k X n k
Centroid : frequency of energy concentration / 2
1
( ) ( , )N
c kk
F n p n k f
n = time index, k = frequency index, N = DFT size
Variance : spread of energy around the centroid 1/ 2/ 2
2
1
( ) ( ( )) ( , )N
k ck
F n f F n p n k
Skewness : measure of spectral symmetry 1/3/ 2
3
1
( ) ( ( )) ( , )N
s k ck
F n f F n p n k
Kurtosis : measure of spectral peakiness1/ 4/ 2
4
1
( ) ( ( )) ( , )N
k k ck
F n f F n p n k
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
14/21
Example: Band energy parameters & spectral moments for /aga/
Time (ms)
(a) Waveform
(b)
(c)
(d)
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
15/21
Measures of rate of change
● First difference based rate of change (ROC)
K = time step
ROC( ) ( ) ( )b bn E n E n K
● Mahalanobis distance based rate of change (ROC-MD)
A single measure indicative of the overall variation, taking care of parameter range and correlation effects
0.51mdROC ( ) ( ( ) ( )) ( ( ) ( ))Tn n n K n n K y y y y
y(n) = parameter set at time nK = time step = covariance matrix, pre-calculated using the parameter set from segments with energy above a threshold
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
16/21
Detection of voicing offset and onset ▪ Band energy in 0-400 Hz▪ ROC(n) computed with time step 50 ms▪ Voicing offset [g-] : ROC(n) -12 dB ▪ Voicing onset [g+] : ROC(n) +12 dB
Burst onset landmark detectionMost prominent peak in the ROC-MD(n) between g- and g+
Example /aga/
(b) ROC-MD
(c) ROC
Time (ms)
(a) Waveform
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
17/21
3. EVALUTATION & RESULTS
Effects of rate of change functions & parameters on burst detection
ROC and parameters
1) ROC(BE): Sum of normalized ROCs of [Eb1, Eb2, Eb3]
2) ROC-MD(BE): ROC-MD of [Eb1, Eb2, Eb3]
3) ROC-MD(SM): ROC-MD of [Fc , F , Fk , Fs]
4) ROC-MD(BE,SM): ROC-MD of [Eb1, Eb2, Eb3, Fc , F , Fk , Fs]
Material: VCV utterances, TIMIT sentences
Time steps: 3, 6 ms
Temporal accuracies: 3, 5, 10, 15, 20 ms
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
18/21
VCV utterances▪ 6 stop consonants (b, d, g, p, t, k)▪ 3 vowel contexts (a, i, u)▪ 10 speakers (5 M, 5 F)▪ 180 tokens
20
40
60
80
100
3 5 10 15 20 3 5 10 15 20 3 5 10 15 20 3 5 10 15 20
Temporal accuracy (ms)
Det
ectio
n ra
te (%
)
3 ms 6 ms
8187 86
97
76
90 9099
Time step
ROC(BE) ROC-MD(BE) ROC-MD(SM) ROC-MD(BE, SM)
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
19/21
TIMIT Sentences▪ 5 speakers (2 M, 3 F) ▪ 10 sentences from each speaker ▪ 238 tokens
30405060708090
100
3 5 10 15 20 3 5 10 15 20 3 5 10 15 20 3 5 10 15 20
Temporal accuracy (ms)
Det
ectio
n ra
te (%
)
3 ms
49
74
58
86
45
71
58
88
Time step
ROC(BE) ROC-MD(BE) ROC-MD(SM) ROC-MD(BE, SM)
Error typeInsertion rates (%)
ROC(BE) ROC-MD(BE) ROC-MD(SM) ROC-MD(BE,SM)
Vowel / sem. vowel 13 11 13 11
Frication 5 11 10 9
Glottal stops / clicks 4 3 3 4
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
20/21
4. CONCLUSION
Increase in time steps reduced detection accuracy.
Mahalanobis distance based ROC was more effective than first-difference based rate of change.
Spectral moments were useful as additional parameters in improving burst-onset detection.
I IT B
om
bay
17 th National Conference on Communications , 28-30 Jan. 2011, Bangalore, India Sp Pr. 1, P3
21/21
Thank you