speech enhancement - university of rochesterzduan/teaching/ece477/lectures/speech... · spectral...
TRANSCRIPT
![Page 1: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/1.jpg)
SPEECH ENHANCEMENT
Sefik Emre Eskimez
Dept. of Electrical and Computer Engineering
University of Rochester, Rochester, NY
![Page 2: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/2.jpg)
Motivation Corruption present in speech signal reduces the
performance of the automatic processes, such as:
Automatic speech recognition (ASR)
Automatic speaker identification/verification (ASID/ASV)
Automatic speech emotion recognition (ASER)
Try it with Amazon’s Alexa and Google’s assistant
Hearing implants performance suffers in noise
conditions
![Page 3: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/3.jpg)
Ideal Cases
![Page 4: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/4.jpg)
Problem Definition – Additive
Noise
𝑠(𝑡) is the speech signal
𝑛 𝑡 is the noise signal
𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡),
given 𝒎 𝒕 , estimate 𝒔 𝒕 !
![Page 5: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/5.jpg)
Approaches1. Spectral Subtraction
Estimate the noise spectrum and subtract it from the noisy speech
spectrum.
a. Wiener Filtering
LTI filter to estimate clean speech.
b. Log-Minimum Mean Square Error (MMSE) Short-Time Spectral
Amplitude (STSA) Estimator
A short-time spectral amplitude (STSA) estimator which minimizes the mean-square error of
the log-spectra
2. Non-negative Dictionary Learning
Utilizes sparse coding and a voice activity detector to find which
frames belongs to noise and which belongs to speech. Usually
two dictionaries are built for speech and noise.
3. Deep Learning Approaches
Early work
1979-1984
![Page 6: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/6.jpg)
Spectral Subtraction Taking the Fourier Transform Yields:
𝑚 𝑡 = 𝑠 𝑡 + 𝑛(𝑡) ⟷ 𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 +𝑁(𝑒𝑗𝑤)
Speech spectra 𝑆 𝑒𝑗𝑤 can be represented as:
𝑆 𝑒𝑗𝑤 = 𝑀 𝑒𝑗𝑤 − 𝑁𝜇 (𝑒𝑗𝑤) 𝑒𝑗𝜃𝑀,
where 𝑆 and 𝑁 are speech and noise estimates.
𝑁𝜇 (𝑒𝑗𝑤) = Ε 𝑁(𝑒𝑗𝑤)
Noise estimate is usually calculated using first few frames
of the input signal
![Page 7: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/7.jpg)
Wiener Filtering
𝑀 𝑒𝑗𝑤 = 𝑆 𝑒𝑗𝑤 + 𝑁 𝑒𝑗𝑤
A filter can be defined as follows:
𝐻 𝑒𝑗𝑤 =𝑆 𝑒𝑗𝑤
𝑀(𝑒𝑗𝑤)
The filter can be estimated using the noise estimate:
𝐻 𝑒𝑗𝑤 =𝑀 𝑒𝑗𝑤 − 𝑁 (𝑒𝑗𝑤)
𝑀(𝑒𝑗𝑤)𝐻 𝑒𝑗𝑤𝑠(𝑡)
n(𝑡)
𝑠 (𝑡)
![Page 8: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/8.jpg)
Log-Minimum Mean Square Error (MMSE)
Short-Time Spectral Amplitude (STSA)
Let’s simplify the notation: 𝑆 𝑒𝑗𝑤 → 𝑆
Log-MMSE STSA minimizes the logarithmic mean
square error
Ε log10 𝑆 − log10( 𝑆)2
![Page 9: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/9.jpg)
Non-negative Dictionary
Learning Let us denote the basis matrix of speech and noise as
𝑊𝑠 and 𝑊𝑛 respectively
The basis matrix for the noisy signal
𝑊 = 𝑊𝑠𝑊𝑛
The noisy spectrogram can be represented as 𝑀 ≈ 𝑊𝐻,
where the noisy NMF coefficients defined as
𝐻 = 𝐻𝑠𝑇𝐻𝑛𝑇𝑇
![Page 10: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/10.jpg)
Non-negative Dictionary
Learning
The mask can be obtained as follows:
𝑚𝑠 =𝑊𝑠𝐻𝑠
𝑊𝑠𝐻𝑠+𝑊𝑛𝐻𝑛
![Page 11: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/11.jpg)
Time-Frequency (T-F) Masks
T-F masks operate on the magnitude
spectra of the signal.
Let 𝑆𝑡 𝑓 , 𝑁𝑡 𝑓 and 𝑀𝑡(𝑓) be the
magnitude spectra of the speech, noise
and mixture signal, respectively.
![Page 12: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/12.jpg)
Time-Frequency (T-F) Masks
𝑆 𝑁
𝑀
![Page 13: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/13.jpg)
T-F Masks
Ideal Binary Masks (IBM), 0 dB
𝐼𝐵𝑀𝑡(𝑓) 1 , 𝑖𝑓 𝑆𝑡 𝑓 > 𝑁𝑡 𝑓0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
![Page 14: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/14.jpg)
Ideal Binary Masks (IBM)
𝑆 𝑀
𝐼𝐵𝑀 𝐼𝐵𝑀 ⊙ 𝑀
![Page 15: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/15.jpg)
The problem becomes a binary
classification task!
Given 𝑀𝑡 𝑓 , determine whether it belongs to
speech or noise
Can be estimated with any machine learning
classifier
Problem: The results obtained from the
ground-truth IBM mask has “musical noise”
Ideal Binary Masks (IBM)
![Page 16: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/16.jpg)
T-F Masks
Amplitude Soft Masks (ASM) or Ideal
Ratio Masks (IRM)
𝐼𝑅𝑀𝑡 𝑓 =𝑆𝑡 𝑓
𝑆𝑡 𝑓 + 𝑁𝑡 𝑓
![Page 17: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/17.jpg)
Ideal Ratio Masks (IRM)
𝑆 𝑀
𝐼𝑅𝑀 𝐼𝑅𝑀 ⊙ 𝑀
![Page 18: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/18.jpg)
Predicting Masks – System
Overview
![Page 19: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/19.jpg)
Predicting Masks – Features
Mel-Frequency Cepstrum (MFC)
Magnitude Spectra
Raw waveform
Can be supplemented by traditional
features
![Page 20: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/20.jpg)
Autoencoder based methods
Two types:
1. Trained with only clean speech
Network learns speech
representation
2. Trained with noisy-clean speech
pairs
Network learns transfer function from
noisy to clean speech
![Page 21: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/21.jpg)
Recurrent Neural Network
(RNN) RNNs are useful for modeling temporal
relations
Huang et al. (Huang, Kim et al. 2015)
proposed predicting masks with the
following cost function:
min 𝑚𝑠 − 𝑚𝑠2 + 𝑚𝑛 − 𝑚𝑛
2
− 𝑚𝑠 − 𝑚𝑛2 − 𝑚𝑛 − 𝑚𝑠
2
where 𝑚𝑠 and 𝑚𝑛 are the speech and
noise masks, respectively.
![Page 22: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/22.jpg)
Redundant Convolutional
Encoder-Decoder (R-CED)
Park et al. (Park and Lee 2016)
proposed a convolutional network with
1-dimensional convolutional operation
which operates on frequency axis
Convolutional networks have fewer
parameters than RNN, which makes
them feasible for small devices, such
as hearing implants!
![Page 23: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/23.jpg)
Predicting Masks – Our Methods
Convolutional Encoder-Decoder (CED)
network with skip connections
Bidirectional Long Short-Term Memory
(BLSTM) network
![Page 24: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/24.jpg)
Convolutional Encoder-Decoder
(CED)
InputSpectrogram
skipconnections
ConvolutionalEncoder-Decoder(CED)
Conv
BNReLU
64filters
Deconv
BNReLU
64filters
Conv
BNReLU
128filters
Conv
BNReLU
256filters
Conv
BNReLU
512filters
Deconv
BNReLU
256filters
Deconv
BNReLU
128filters
Speechmask
Noisemask
Conv
BNReLU1filter
Conv
BNReLU1filter
![Page 25: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/25.jpg)
Bidirectional Long Short-Term
Memory (BLSTM)
![Page 26: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/26.jpg)
Predicting Masks – Comparison
with other methods
(a)Noisyspectrogram
(b)Cleanspectrogram
(c)Enhanced(SS)spectrogram
(d)Enhanced(Log-MMSE)spectrogram
(e)Enhanced(RNN)spectrogram
(f)Enhanced(R-CED)spectrogram
(g)Enhanced(BLSTM)spectrogram
(h)Enhanced(CED)spectrogram
![Page 27: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/27.jpg)
Evaluation metrics
• Objective measures:
• Perceptual evaluation of speech quality (PESQ) – Ranges from -0.5
to 4.5
• Short-time Objective Intelligibility (STOI) – Ranges from 0 to 1
• Segmental SNR (SSNR, in dB)
• Log-spectral distortion (LSD, in dB)
• Hearing aids speech quality index (HASQI)
• Hearing aids speech perception index (HASPI)
• Speech distortion index (SDI)
• Subjective measures:
• Listening tests
The most
important
metrics!
![Page 28: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/28.jpg)
RESULTS - PESQ
![Page 29: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/29.jpg)
RESULTS - STOI
![Page 30: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/30.jpg)
More examples…
![Page 31: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/31.jpg)
The End…
Thank you!
![Page 32: SPEECH ENHANCEMENT - University of Rochesterzduan/teaching/ece477/lectures/Speech... · Spectral Subtraction ... Park et al. (Park and Lee 2016) ... Speech Enhancement Using Nonnegative](https://reader031.vdocument.in/reader031/viewer/2022022523/5b3975d67f8b9a40428e9219/html5/thumbnails/32.jpg)
References
Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.
Boll, Steven. "Suppression of acoustic noise in speech using spectral subtraction." IEEE Transactions
on acoustics, speech, and signal processing 27.2 (1979): 113-120.
Ephraim, Yariv, and David Malah. "Speech enhancement using a minimum mean-square error log-
spectral amplitude estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing 33.2
(1985): 443-445.
Huang, Po-Sen, et al. "Joint optimization of masks and deep recurrent neural networks for monaural
source separation." IEEE/ACM Transactions on Audio, Speech and Language Processing
(TASLP) 23.12 (2015): 2136-2147.
Park, Se Rim, and Jinwon Lee. "A fully convolutional neural network for speech
enhancement." arXiv preprint arXiv:1609.07132 (2016).
Mohammadiha, Nasser. Speech Enhancement Using Nonnegative Matrix Factorization and Hidden
Markov Models. Diss. KTH Royal Institute of Technology, 2013.
Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. "On training targets for supervised speech
separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12
(2014): 1849-1858.
Lu, Xugang, et al. "Speech enhancement based on deep denoising autoencoder." Interspeech. 2013.