labrosa research overview - columbia university
TRANSCRIPT
LabROSA Research Overview - Dan Ellis 2014-06-12 - /201
1. Music 2. Environmental sound 3. Speech Enhancement
LabROSA Research Overview
Dan Ellis Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA !
[email protected] http://labrosa.ee.columbia.edu/
Laboratory for the Recognition andOrganization of Speech and Audio
COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
LabROSA
• Getting information from sound
2
InformationExtraction
MachineLearning
SignalProcessing
Speech
Music EnvironmentRecognition
Retrieval
Separation
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
1. Music Audio Analysis
• Trained classifiers for low-level information • notes, chords, beats, section boundaries
• E.g. Polyphonic transcription !!!!!!
• feature agnostic • needs training data
3
Poliner & Ellis ’06
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Million Song Dataset
• Industrial-scale database for music information research
• Many facets: • Echo Nest audio features
+ metadata • Echo Nest “taste profile”
user-song-listen count • Second Hand Song covers • musiXmatch lyric BoW • last.fm tags
• Now with audio? • resolving artist / album / track / duration
against what.cd
4
Bertin-Mahieux McFee
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
MIDI-to-MSD
• Aligned MIDI to Audio is a nice transcription !!!!!!!!!
• Can we find matches in large databases?
5
Raffel
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Singing ASR
• Speech recognition adapted to singing • needs aligned data
• Extensive work to line up scraped “acapellas” and full mix • including jumps!
6
McVicar
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Block Structure RPCA
• RPCA separates vocals and background based on low rank optimization • single trade-off parameter • adjust based on higher-level musical features?
7
Papadopoulos
Table 1. Sound excerpts used for the evaluation and proportion of purely-instrumental segments (P.I.) (in% of the whole excerpt duration).Name % (P.I.) Name % (P.I.) Name % (P.I.)1- Beatles Sgt Pepper’s Lonely Hearts Club Band 49.3 5,6 - Puccini piece for soprano and piano 24.7 10 - Marvin Gaye Heard itThrough The Grapevine 30.22 - BeatlesWith A Little Help From My Friends 13.5 7 - Pink Noise Party Their Shallow Singularity 42.13 - Beatles She’s Leaving Home 24.6 8 - Bob Marley Is This Love 37.2 11 - The Eagles Take it Easy 35.54 - Beatles A Day in The Life 35.6 9 - Doobie Brothers Long Train Running 65.6 12 - The PoliceMessage in aBottle 24.9
mixture is computed using a window length of 1024 samples with75% overlap at a sampling rate of 11.5KHz. No post-processing(such as masking) is added.4.2. Results and Discussion
Fig. 2. Separation performance of the leading singing voice with the base-line method, for various values of λ, for the song Their Shallow Singularity.
Fig. 3. Separation performance for the background (left) and the singingvoice (right) via, from top to bottom, the SDR, SIR, SAR and NSDR mea-sures for each song. Constant λ = 1 (∗), adaptive λ = (1, 5) with priorground truth (•) and estimated (◦) voice activity location.
• Global separation results. As illustrated by Fig. 2, the qual-ity of the separation with the baseline method [18] depends on thevalue of the regularization parameter. Moreover, the value that leadsto the best separation quality differs from one music excerpt to an-other. Thus, when processing automatically a collection of musictracks, the choice of this value results from a trade-off. We reporthere results obtained with the typical choice λv = 1. In A-RPCA,this regularization parameter is further adapted to the music contentbased on prior music information. In all experiments, for a givenconstant value λv in the baseline method, setting λnv > λv in Eq.(7) improves the results6. Results of the separation obtained withvarious configurations of the proposed model are described in Fig.3. Using a musically-informed adaptive regularization parameter al-lows improving the results of the separation both for the backgroundand the leading voice components. Note that the larger the propor-tion of purely-instrumental segments in a piece (see Tab. 1), the
6For lack of space, we do not report all of the experiments obtained withvarious values of λ.
larger the results improvement (see in particular pieces 1, 7, 8 and 9in Fig. 2), which is consistent with the goal of the proposed method.
There is however one drawback: improved SDR (better over-all separation performance) and SIR (better capability of removingmusic interferences from the singing voice) with A-RPCA are ob-tained at the price of introducing more artifacts in the estimated voice(lower SARvoice). Listening tests reveal that in some segments pro-cessed by A-RPCA, as for instance segment [1 − 1.15]m in Fig.4, one can hear some high frequency isolated coefficients superim-posed to the separated voice. This drawback could be reduced byincluding harmonicity priors in the sparse component of RPCA, asproposed in [20].
• Ground truth versus estimated voice activity location. Im-perfect voice activity location information still allows an improve-ment, although to a lesser extent than with ground-truth voice ac-tivity information. The decrease in the results mainly comes frombackground segments classified as vocal segments.
Fig. 4. Separated voice for various values of λ for the Pink Noise Party songTheir Shallow Singularity. From top to bottom: clean voice, constant λ = 1,constant λ = 5, adaptive λ = (1, 5).
• Local separation results. It is interesting to note that using anadaptive regularization parameter in a unified analysis of the wholepiece is different from separately analyzing vocal and purely instru-mental segments with different but constant values of λ. This isillustrated in the dashed rectangles areas of Fig. 4. Moreover, localresults7 with the unified analysis, show not only that the sparse com-ponents (singing voice) are limited in purely-instrumental segments,but also that the energy of music background is better weakened inthe resynthesized voice in vocal segments (better local SIRvoice).
5. CONCLUSIONWe have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including local vari-ations in the music content. Music content information is incorpo-rated in the decomposition to guide the selection of coefficients inthe sparse and low-rank layers according to the semantic structureof the piece. We have focused on a simple criterion (voice activityinformation), but the method could be extended with other criteria(singer identification, vibrato saliency. etc.). The method could beimproved by incorporating additional information to set differentlythe regularization parameters for each track to better accommodatethe varying contrast of foreground and background. The idea of anadaptive decomposition could also be improved with a more com-plex formulation of RPCA that incorporates additional constraints[20] or a learned dictionary [46].
7For space constraint, local BSS-eval results are not reported.
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
• Low-rank decomposition of skewed self-similarity to identify repeats
• Learned weighting of multiple factors to segment !
• Linear Discriminant Analysis between adjacent segments
Ordinal LDA Segmentation
8
McFee
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
2. Environmental Sound
• Extracting useful information from soundtracks
• e.g. TRECVID Multimedia Event Detection (MED) • “Making a Sandwich”, “Getting a Vehicle Unstuck” • 100 examples, find matches in 100k videos • manual annotations for ~10 h
9
E009 Getting a Vehicle Unstuck
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Foreground Event Recognition
• Transients = foreground events?
• Onset detector finds energy bursts • best SNR
• PCA basis to represent each • 300 ms x auditory
freq • “bag of transients”
10
Cotton, Ellis, Loui ’11
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
NMF Transient Features
• Decompose spectrograms intotemplates + activation !!!!!
• well-behaved gradient descent
• 2D patches • sparsity control • computation
time…
11
Smaragdis & Brown ’03 Abdallah & Plumbley ’04
Virtanen ’07 Cotton & Ellis’ 11
Original mixture
Basis 1(L2)
Basis 2(L1)
Basis 3(L1)
freq
/ Hz
442883
1398220334745478
0 1 2 3 4 5 6 7 8 9 10time / s
X = W • H
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Background Retrieval
• Classify soundtracks by statistics of ambience • E.g. Texture features
!!!!!• Subband
distributions • Envelope
cross-corrs
12
SoundAutomatic
gaincontrol
Envelopecorrelation
Cross-bandcorrelations
(318 samples)
Modulationenergy(18 x 6)
mean, var,skew, kurt(18 x 4)
melfilterbank
(18 chans) x
x
xxxx
FFT Octave bins0.5,1,2,4,8,16 Hz
Histogram
617
1273
2404
5
10
15
0.5 2 8 32
0.5 2 8 320.5 2
8 32
5 10 15 8 32
5 10 15
0
level
1
freq / H
z
mel band
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V MK S V M
McDermott et al. ’09 Ellis, Zheng, McDermott ’11
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Auditory Model Features
• Subband Autocorrelation PCA • Simplified version of autocorrelogram • 10x faster than Lyon original
• Capture fine time structure in multiple bands • information lost in MFCCs
13
Lyon et al. 2010 Lee & Ellis 2012
Cotton & Ellis 2013
Cochleafilterbank
Sound
short-timeautocorrelation
delay line
frequ
ency
chan
nels
Correlogramslice
freqlag
time
Subband VQ
Subband VQ
Subband VQ
Subband VQ
Histogram
Feat
ure
Vecto
r
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Subband Autocorrelation
• Autocorrelation stabilizesfine time structure
14
Cochleafilterbank
Sound
short-timeautocorrelation
delay line
frequ
ency
chan
nels
Correlogramslice
freqlag
time
Subband VQ
Subband VQ
Subband VQ
Subband VQ
Histogram
Feat
ure
Vecto
r
• 25 ms window, lags up to 25 ms
• calculated every 10 ms
• normalized to max (zero lag)
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Retrieval Examples
• High precision for in-domain top hits
15
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
3. Speech Enhancement
• Noisy speech scenarios • Ambient recording (background noise) • Communication channel (processing distortion)
16
0
50
100
level
/ dB
CAR KIT - BP 101 43086 20111025 160708 in
1500 Hz chan
dB
0 1000 2000 30000
50
100HOME LAND - BP 101 84943 20111020 144955 in
0 1000 2000 3000freq / Hz freq / Hz
0
50
100
freq
/ Hz
time / s87 90 93 96 99
2000
4000
time / s
level / dB
67 70 73 76 79
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
RPCA Enhancement
• Decompose spectrogram into sparse + low-rank
• Sparse activation H of dictionary W !!!!
• ASR benefits:
17
minH,L,S
�HkHk1 + �LkLk⇤ + �SkSk1
+ I+(H)
s.t. Y = WH + L+ S
C S D I Orig 6.8 10.6 82.6 0.7
RPCA 10.8 36.5 52.7 0.5
wie+RPCA 10.4 40.1 49.5 2.1
Chen, McFee & Ellis ’14
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Classification Pitch Tracker
• SAcC: MLP trained on noisy speech with ground-truth pitch track targets !!!!!!
• Large benefits forin-domain noisyspeech
18
Lee & Ellis ’12
−10 0 10 20 30
PTE
(%)
FDA − RBF and pink noise
10
100
SNR (dB)
10
YINWuget_f0SAcC
LabROSA Research Overview - Dan Ellis 2014-06-12 - /20
Pitch-Normalized Enhancement
• Use noise-robust pitch tracker for enhancement?
19
• Normalize voice pitch !
• Fixed-pitch enhancement !
• Reimpose pitch
Noisy signal
Clean signal
0 0.5 1 1.5 2 2.5 30
500
1000pitchsmoothed pvx
Freq
uenc
yresampled to pitch = 100 Hz
0 0.5 1 1.5 2 2.5 3 3.5 4 4.50
500
1000
Filtered − pvsmooth
0 0.5 1 1.5 2 2.5 3 3.5 4 4.50
500
1000
Time
Resampled back to original pitch
0 0.5 1 1.5 2 2.5 30
500
1000
0 0.5 1 1.5 2 2.5 30
500
1000
−40−20020