research in sound analysisdpwe/talks/rosaview-2008-09.pdf · sound analysis - ellis 2008-09-05 p....
TRANSCRIPT
-
Sound Analysis - Ellis 2008-09-05 p. /16
1. Personal and Consumer Audio2. Musical Cover Song Detection3. Binaural Source Separation
Research in Sound Analysis Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA
1
-
Sound Analysis - Ellis 2008-09-05 p. /16
LabROSA Overview
2
InformationExtraction
MachineLearning
SignalProcessing
Speech
Music EnvironmentRecognition
Retrieval
Separation
-
Sound Analysis - Ellis 2008-09-05 p. /16
1. Personal Audio Archives
• Easy to record everything you hear
-
Sound Analysis - Ellis 2008-09-05 p. /16
Browsing Interface• Browsing / Diary interface
links to other information (diary, email, photos)synchronize with note taking? (Stifelman & Arons)audio thumbnails
• Release Tools + “how to” for capture
4
!"#!!
!"#$!
!%#!!
!%#$!
&!#!!
&!#$!
&!!
&$!
&'#!!
&'#$!
&$#!!
&$#$!
&(#!!
&(#$!
&)#!!
&)#$!
&*#!!
&*#$!
&+#!!
&+#$!
&"#!!
&"#$!
&%#!!
&%#$!
'!#!!
,-./01223
045.
,-./01223
045.3.067-.
25580.
25580.276922-
:-27,
34;
045.
-
Sound Analysis - Ellis 2008-09-05 p. /16
Consumer Video
• Short video clips as the evolution of snapshots10-60 sec, one location, no editingbrowsing?
• More information for indexing...video + audioforeground + background
5
-
Sound Analysis - Ellis 2008-09-05 p. /16
MFCC Covariance Representation• Each clip/segment → fixed-size statistics
similar to speaker ID and music genre classification
• Full Covariance matrix of MFCCs maps the kinds of spectral shapes present
• Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model
6
VTS_04_0001 - Spectrogram
freq / k
Hz
1 2 3 4 5 6 7 8 90
1
2
3
4
5
6
7
8
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
MF
CC
bin
1 2 3 4 5 6 7 8 9
2468
101214161820
-20-15-10-505101520
MFCC dimension
MF
CC
dim
ensio
n
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video Soundtrack
MFCCfeatures
MFCCCovariance
Matrix
-
Sound Analysis - Ellis 2008-09-05 p. /16
Audio-Only Results
• Wide range of results:
audio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes
7
Anim
alBa
byBe
ach
Birth
day
Boat
Crow
d
Grou
p of 3
+
Grou
p of 2
Muse
umNi
ght
One p
erso
nPa
rkPic
nic
Playg
roun
dSh
owSp
orts
Suns
et
Wed
ding
Danc
ing
Para
de
Singin
gCh
eer
Music
Concepts
Ski0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
AP
1G + KL1G + MahGMM Hist. + pLSAGuess
-
Sound Analysis - Ellis 2008-09-05 p. /16
How does it ‘feel’?
• Browser impressions: How wrong is wrong?
8
Top 8 hitsfor “Baby”
-
Sound Analysis - Ellis 2008-09-05 p. /16
2. Cover Song Detection: Chroma
• Chroma features map spectral energy into one canonical octavei.e. 12 semitone bins
• Can resynthesize as “Shepard Tones”all octaves at once
9
Piano scale
2 4 6 8 10 100 200 300 400 500 600 700time / sec
freq / k
Hz
0
1
2
3
4
time / frames
chro
ma
A
C
D
F
G
Piano chromatic scale IF chroma
0 500 1000 1500 2000 2500 freq / Hz-60-50-40-30-20-10
0
2 4 6 8 10 time / sec
freq
/ kH
z
leve
l / d
B
0
1
2
3
4Shepard tone resynth12 Shepard tone spectra
-
Sound Analysis - Ellis 2008-09-05 p. /1610
Beat-Synchronous Chroma Features
• Beat + chroma features / 30ms frames→ average chroma within each beatcompact; sufficient?
! "# "! $# $! %# %!
$
&
'
(
"#
"$
"#
$#
%#
$
&
'
(
"#
"$
# ! "# "!)*+,-.-/,0
)*+,-.-1,2)/
34,5-.-6,7
89/,)-/)4,9:);
0;48+2-1*9/
0;48+2-1*9/
#
-
Sound Analysis - Ellis 2008-09-05 p. /16
Cross-Correlation Matching
• Look inside global cross-correlation to find matching fragments...
xcorr = t f (C1(t, f)⋅C2(t, f)) - view along time
11
Let It Be / Beatles (beats 11-441)
50 100 150 200 250 300 350 400
0 50 100 150 200 250 300 350 400
-0.2
0
0.2
0.4
Let It Be / Nick Cave (beats 13-443)
50 100 150 200 250 300 350 400 time / beats
time / beats
time / beats
ch
rom
a
A
C
D
F
G
chro
ma
A
C
D
F
G
-
Sound Analysis - Ellis 2008-09-05 p. /16
“The Meaning of Music”The ultimate goal of this research...
• What does music evoke in a listener’s mind?
i.e. “what does it all mean?” (metaphysics?)study with subjective experiments(then build detectors for specific responses ...?)
• What phenomena are denoted by “music”?
i.e. delineate the “set of all music”(the ultimate music/nonmusic classifier?)
12
?
-
Sound Analysis - Ellis 2008-09-05 p. /16
3. Binaural Source Separation• 2 or 3 sources in reverberation
assume just 2 ‘ears’
• Tasks:identify positions of sources (and number?)recover source signals
13
-
Sound Analysis - Ellis 2008-09-05 p. /16
Spatial Estimation in Reverb
• Model interaural spectrum of each sourceas stationary level and time differences:
converge via EM to a(), τ for each sourcemask is Pr(X(t,ω) dominated by source i)
14
Mandel & Ellis ’07
LeftLeft ILDILD
IPDIPD
8.77dB8.77dB
Mask 1Mask 1
Mask 2Mask 2
RightRight
-
Sound Analysis - Ellis 2008-09-05 p. /16
Spatial Estimation Results
• Modeling uncertainty improves resultstradeoff between constraints & noisiness
15
2.45 dB2.45 dB
9.12 dB9.12 dB
0.22 dB0.22 dB
-2.72 dB-2.72 dB
12.35 dB12.35 dB8.77 dB8.77 dB
-
Sound Analysis - Ellis 2008-09-05 p. /16
Conclusions
• LabROSAinformation from sound ...... via signal processing and machine learning
• Environmental Sound• Music Audio• Source Separation• Speech, models, dolphins...
16