noise robust pitch tracking by subband autocorrelation...

1
Noise Robust Pitch Tracking by Subband Autocorrelation Classification (SAcC) Byung Suk Lee & Dan Ellis Columbia University / ICSI {bsl,dpwe}@ee.columbia.edu Summary: A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components. Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data. C OLUMBIA U NIVERSITY Laboratory for the Recognition and Organization of Speech and Audio for Interspeech 2012 • 2012-09-11 • [email protected] MATLAB code available: http://labrosa.ee.columbia.edu/projects/SAcC/ Background Material Filterbank Classifier • “Classic” approaches to pitch tracking reveal time-domain waveform periodicity via autocorrelation or related approaches. An example is YIN [de Cheveigné & Kawahara 2002]: We are specifically interested in pitch tracking for low quality radio reception data (e.g. hand-held narrow-FM) (RATS project). This data has both high noise/distortion and narrow bandwidth. 48, 4-pole 2-zero ERB-scale cochlea filter approximations form auditory-like subbands. The core of our system is a trained (MLP) classifier. It takes k (10) PCA coefficients for each of s (48) subband autocorrelations and estimates posteriors over p (67) quantized pitch values. The MLP is trained discriminatively on noisy data (with ground truth pitches). Discriminative training virtually eliminates octave/suboctave errors seen in peak-based pitch tracking. Added interference hurts the autocorrelation, but filtering into subbands can reveal certain bands with better local SNR. Also, the contribution of each subband can be adjusted in later fusion to select sources. An example is [Wu, Wang & Brown 2003] (“the Wu algorithm”): This work began as an investigation into the peak selection and integration stages of the Wu algorithm. We found that replacing both stages with a trained classifier offered large performance improvements, as well as the chance to train domain-specific pitch trackers. Training Data Evaluation The pitch classifier needs training data with ground truth. Performance improves when training data is more similar to test data. We used pitch-annotated data (Keele), then artificially filtered and added noise to resemble the target domain. We also generated pseudo-ground- truth for target domain data by using only frames where three independent pitch trackers agreed. Autocorrelation- based difference function Cumulative normalization and threshold Local peak selection & interpolation Sound Pitch tracking using HMM* * not in standard YIN Pitch Sound Auditory Filterbank Normalized Autocorelation Channel/Peak Selection Cross-channel Integration Pitch Tracking using HMM Pitch time / s GPE = E vv /N vv E vv N vv VE = (E vv +E vu )/N v E vv E vu N v UE = E uv /N u PTE = (VE+UE)/2 E uv N u N u −10 0 10 20 PTE (%) FDA − RBF and pink noise 0 10 20 30 1 10 100 1 10 100 1 10 100 1 10 GPE (%) −10 0 10 20 30 VE (%) SNR (dB) −10 0 10 20 30 UE (%) SNR (dB) YIN Wu get_f0 SAcC 0 agV agU dUV dVV 500 1000 Freq / Hz YIN Wu SWIPE’ Agreement 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 0 50 100 Pitch Index Frame Label Reject Clean Speech Subband Lags 50 100 150 200 250 300 350 400 5 10 15 20 25 30 35 40 45 Autocorrelation Principal Component Analysis Hidden Markov Model Smoothing Normalized autocorrelation out to 25 ms (400 samples @ 16 kHz) reveals periodic structure in each subband. The autocorrelation in each subband is highly constrained by the bandlimited input. Pitch information is distributed throughout each autocorrelation row; simple peak-picking ignores most of this. The autocorrelations in a given subband always reflect the center frequency, i.e., they occupy a small part of the 400-D space. • We use per-subband PCA to reduce each band to 10 coefficients, which preserves almost all the variance. PCA bases remain stable when learned from different signal conditions, so are fixed. The MLP generates posterior probabilities across 66 pitch candidates + “no pitch” for every 10 ms time frame. These become a single pitch track via Viterbi decoding through an HMM with pitch states. Transition probabilities are set parametrically (pitch-invariant) and tuned empirically. We tested on the pitch- annotated FDA data with radio-band filtering and pink noise added at a range of SNRs. SAcC trained on Keele data (with similar corruption) substantially outperformed other pitch trackers. Later experiments show that SAcC can generalize to mismatched train/test scenarios. GPE, the most common pitch tracking metric, only considers frames where both ground truth and system report a pitch value. This rewards “punting” on difficult frames. We define VE as accuracy over all true-voiced frames, and UE over all true-unvoiced frames. We evaluate by PTE, the mean of VE and UE. Freq / kHz Time / s 0 0.5 1 1.5 1 2 Gain / dB Radio-band filter (RBF) 0 -20 -40 YIN Wu Pitch Index time / s 1 0 Pitch “scores” from YIN, Wu, and SAcC References A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., 111(4):1917–1930, April 2002. M. Wu, D.L. Wang, and G.J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech and Audio Proc., 11(3):229–241, May 2003. MLP output SAcC log MLP output Wu likelihood 1 - YIN similarity

Upload: others

Post on 19-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Noise Robust Pitch Tracking by Subband Autocorrelation ...dpwe/talks/interspeech-2012-09-SAcC.pdf · Noise Robust Pitch Tracking by Subband Autocorrelation Classification (SAcC) Byung

Noise Robust Pitch Trackingby Subband Autocorrelation Classification (SAcC)

Byung Suk Lee & Dan Ellis • Columbia University / ICSI • {bsl,dpwe}@ee.columbia.edu

Summary: A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components. Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data.

COLUMBIA

UNIVERSITYLaboratory for the Recognition andOrganization of Speech and Audio

for Interspeech 2012 • 2012-09-11 • [email protected] code available: http://labrosa.ee.columbia.edu/projects/SAcC/

Background

Material

Filterbank Classifier• “Classic” approaches to pitch tracking reveal time-domain waveform

periodicity via autocorrelation or related approaches.• An example is YIN [de Cheveigné & Kawahara 2002]:

• We are specifically interested in pitch tracking for low quality radio reception data (e.g. hand-held narrow-FM) (RATS project).

• This data has both high noise/distortion and narrow bandwidth.

• 48, 4-pole 2-zero ERB-scale cochlea filter approximations form auditory-like subbands.

• The core of our system is a trained (MLP) classifier.

• It takes k (10) PCA coefficients for each of s (48) subband autocorrelations and estimates posteriors over p (67) quantized pitch values.

• The MLP is trained discriminatively on noisy data (with ground truth pitches).

• Discriminative training virtually eliminates octave/suboctave errors seen in peak-based pitch tracking.

• Added interference hurts the autocorrelation, but filtering into subbands can reveal certain bands with better local SNR. Also, the contribution of each subband can be adjusted in later fusion to select sources.

• An example is [Wu, Wang & Brown 2003] (“the Wu algorithm”):

• This work began as an investigation into the peak selection and integration stages of the Wu algorithm. We found that replacing both stages with a trained classifier offered large performance improvements, as well as the chance to train domain-specific pitch trackers.

Training Data

Evaluation

• The pitch classifier needs training data with ground truth. Performance improves when training data is more similar to test data.

• We used pitch-annotated data (Keele), then artificially filtered and added noise to resemble the target domain.

• We also generated pseudo-ground-truth for target domain data by using only frames where three independent pitch trackers agreed.

Autocorrelation-based

di�erence function

Cumulativenormalization

andthreshold

Local peak selection

& interpolation

SoundPitch trackingusing HMM*

* not in standard YIN

Pitch

Sound AuditoryFilterbank

NormalizedAutocorelation

Channel/PeakSelection

Cross-channelIntegration

Pitch Trackingusing HMM

Pitch

time / s

GPE = Evv/NvvEvv

Nvv

VE = (Evv+Evu)/NvEvv Evu

Nv

UE = Euv/Nu PTE = (VE+UE)/2Euv

Nu Nu

−10 0 10 20 30

PTE

(%)

FDA − RBF and pink noise

−10 0 10 20 30

1

10

100

1

10

100

1

10

100

1

10

100

GPE

(%)

−10 0 10 20 30

VE (%

)

SNR (dB)−10 0 10 20 30

102

UE

(%)

SNR (dB)

YINWuget_f0SAcC

0agVagUdUVdVV

500

1000

Freq

/ H

z

YINWuSWIPE’

Agre

emen

t

600 700 800 900 1000 1100 1200 1300 1400 1500 16000

50

100

Pitc

h In

dex

Frame

LabelReject

Clean Speech

Subb

and

Lags50 100 150 200 250 300 350 400

5

10

15

20

25

30

35

40

45

Autocorrelation Principal Component Analysis Hidden Markov Model Smoothing• Normalized autocorrelation out to 25 ms

(400 samples @ 16 kHz) reveals periodic structure in each subband.

• The autocorrelation in each subband is highly constrained by the bandlimited input.

• Pitch information is distributed throughout each autocorrelation row; simple peak-picking ignores most of this.

• The autocorrelations in a given subband always reflect the center frequency, i.e., they occupy a small part of the 400-D space.

•  We use per-subband PCA to reduce each band to 10 coefficients, which preserves almost all the variance.

• PCA bases remain stable when learned from different signal conditions, so are fixed.

• The MLP generates posterior probabilities across 66 pitch candidates + “no pitch” for every 10 ms time frame.

• These become a single pitch track via Viterbi decoding through an HMM with pitch states.

• Transition probabilities are set parametrically (pitch-invariant) and tuned empirically.

• We tested on the pitch- annotated FDA data with radio-band filtering and pink noise added at a range of SNRs.

• SAcC trained on Keele data (with similar corruption) substantially outperformed other pitch trackers.

• Later experiments show that SAcC can generalize to mismatched train/test scenarios.

• GPE, the most common pitch tracking metric, only considers frames where both ground truth and system report a pitch value. This rewards “punting” on difficult frames.

• We define VE as accuracy over all true-voiced frames, and UE over all true-unvoiced frames.

• We evaluate by PTE, the mean of VE and UE.

Freq

/ kH

z

Time / s0 0.5 1 1.5

1

2

Gain / dB

Radio-bandfilter (RBF)

0 -20 -40YI

NW

uPi

tch

Inde

x

time / s10Pitch “scores” from YIN, Wu, and SAcC

ReferencesA. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator

for speech and music,” J. Acoust. Soc. Am., 111(4):1917–1930, April 2002.

M. Wu, D.L. Wang, and G.J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech and Audio Proc., 11(3):229–241, May 2003.

MLP output

SAcC log MLP output

Wu likelihood

1 - YIN similarity