recognizing and classifying environmental soundsdpwe/talks/chime-2013-05.pdf · environmental...
TRANSCRIPT
Environmental Sounds - Dan Ellis 2013-06-01 /341
1. Environmental Sound Recognition2. Foreground Events3. Background Retrieval4. Labels & Annotation5. Future Directions
Recognizing and ClassifyingEnvironmental Sounds
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
Environmental Sounds - Dan Ellis 2013-06-01 /34
1. What is hearing for?
• Hearing = getting information from soundpredators/preycommunication
• Environmental sound recognition is fundamental
2
http://news.bbc.co.uk/2/hi/science/nature/7130484.stm
Environmental Sounds - Dan Ellis 2013-06-01 /34
Environmental Sound Perception
• What do people hear?
sourcesambience
• Mixtures are the rule
3
S1ïhonk honk
S2ïsqueek
S2ïfinal horn
S2ïhorn during Squeek
S3ïcrash (not car)
S3ïcloseup car
S3ï1st horn
S3ï2nd horn
S3ï2nd horn
S3ï3rd horn
S3ïsqueal
S1ïslam
S4ïhorn1
S4ïcrash
S4ïsqueal
S4ïhorn2
S5ïHonk
S5ïTrash can
S5ïHonk 2
S5ïBeepïBeep
S5ïAcceleration
S6ïdouble horn
S1ïhonk, honk
S6ïslam
S6ïdoppler horn
S6ïacceleration
S6ïhorn3
S6ïairbrake
S6ïend double horn
S7ïgunshot
S7ïhornS7ïhorn2
S7ïhorn3
S1ïsqueal
S7ïhorn4
S7ïhorn5
S7ïhorn6
S8ïcar horns
S8ïcar horns
S8ïcar horns
S8ïcar horns
S8ïcar horns
S8ïlarge object crash
S8ïtruck engine
S8ïbreak Squeaks
S9ïhorn 1
S1ïrev up/passing
S9ïhorn 2
S9ïhorn 3
S9ïhorn 4
S9ïdoor Slam?
S9ïhorn 5
S10ïcar horn
S10ïcar horn
S10ïcar horn
S10ïcar horn
S10ïdoor slamming
S10ïwheels on road
S2ïfirst double horn
S2ïcrash
S2ïhorn during crash
S2ïtruck accelerating
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)
Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
Ellis 1996
Environmental Sounds - Dan Ellis 2013-06-01 /34
Sound Scene Evaluations• Evaluations are good for research
help researchers, help funders
• A decade of evaluations:
Metrics: SNR, Frame Acc, Event Error Rate, mAP
4
D-CASE
TRECVID MED
ADC MIREXCHIL/CLEAR
SASSEC SiSECCHiME
Albayzin
SSC/PASCAL
NIST MtgRm
20132012201120102009200820072006200520042003
Meeting roomsAcoustic events
Music transcription
Source separationSpeech separation
Video (soundtrack)classification
Segmentation
VideoCLEF / MediaEvalTRECVID{
Environmental Sounds - Dan Ellis 2013-06-01 /34
DCASE (2012-2013)• 2012 IEEE AASP Challenge:
“Detection and Classification of Acoustic Scenes and Events”Systems submitted Mar 2013Results at WASPAA, Oct 20132 tasks...
• Task 1: Scene classification10 classes x 10 examples x 30 s- street, supermarket, restaurant, office, park ...
evaluate on 100 examples (~1 hour total) by classification accuracy
5
freq
/ kHz
bus01
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
freq
/ kHz
024
busystreet01
office01 openairmarket01
park01 quietstreet01
restaurant01 supermarket01
tube01
5 10 15 20 25 time / s
tubestation01
5 10 15 20 25
Giannoulis et al. 2013
Environmental Sounds - Dan Ellis 2013-06-01 /34
DCASE Event Detection• Task 2: Event detection (“office live”)
16 events x 20 training examples (~ 20 min total)knock, laugh, drawer, keys, phone ...evaluate on ~15 min (?)metrics: frame-level AEER& event-level precision-recall
6
Environmental Sounds - Dan Ellis 2013-06-01 /34
TRECVID MED (2010-2013)• “Multimedia Event Detection”
e.g. MED2011: 15 events x 200 example videos (~60s)- Making a sandwich, Birthday party, Parade, Flash mob
evaluate by mean Average Precisionover 10k-100k videos (200-2,000 hours)audio and video ...- participants have annotated ~1000 videos (> 10 h)
7
E009 Getting a Vehicle Unstuck
Over et al. 2011
Environmental Sounds - Dan Ellis 2013-06-01 /34
Consumer Video Dataset• Columbia Consumer Video (CCV)
9,317 videos / 210 hours20 concepts based on consumer user studyLabeled via Amazon Mechanical Turk
8
!
Y-G. Jiang et al. 2011
Environmental Sounds - Dan Ellis 2013-06-01 /34
Environmental Sound Motivations
• Audio Lifelog Diarization
• Consumer Video Classification & Search
• Real-time hearing prosthesis app
• Robot environment sensitivity
• Understanding hearing
9
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
preschool
cafepreschoolcafelecture
officeofficeoutdoorgroup
lab
cafemeeting2
office
office
outdoorlabcafe
Ron
Manuel
Arroyo?
Lesser
2004-09-13
L2cafeoffice
outdoorlecture
outdoormeeting2
outdoorofficecafeoffice
outdoorofficepostlecoffice
DSP03
compmtg
Mike
Sambarta?
2004-09-14
Environmental Sounds - Dan Ellis 2013-06-01 /34
2. Foreground Event Recognition• “Events” are what we hear / notice
• ASR approach?
events = words? what are subwords?need labeled databut ... mature tools are great
10
Reyes-Gomes & Ellis 2003
s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7
1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7
12345678
1.15 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2time
freq
/ kH
z
dogBarks2
Environmental Sounds - Dan Ellis 2013-06-01 /34
Transient Features
• Transients = foreground events?
• Onset detector finds energy burstsbest SNR
• PCA basis to represent each300 ms x auditory freq
• “bag of transients”
11
Cotton, Ellis, Loui ’11
Environmental Sounds - Dan Ellis 2013-06-01 /34
Transient Features• Results show a
small benefitsimilar to MFCC baseline?
• Examine clusterslooking for semantic consistency...link cluster to label
12
0 0.2 0.4 0.6 0.8 1MAP
museumpicnic
weddinganimalsunset
skibirthday
graduationsports
boatparade
playgroundbabypark
beachdancing
showgroup of 2
night1 person
singingcheercrowdmusic
group of 3+
average precision
class
guessMFCC ftrsevent ftrsfusion
Environmental Sounds - Dan Ellis 2013-06-01 /34
NMF Transient Features
• Decompose spectrograms into templates + activation
well-behaved gradient descent algorithm2D patchessparsity control
computation time...13
Smaragdis & Brown ’03Abdallah & Plumbley ’04
Virtanen ’07
X = W · H
Original mixture
Basis 1(L2)
Basis 2(L1)
Basis 3(L1)
freq
/ Hz
442883
1398220334745478
0 1 2 3 4 5 6 7 8 9 10time / s
Environmental Sounds - Dan Ellis 2013-06-01 /34
NMF Transient Features• Learn 20 patches from
CLEAR Meeting Room events
• Compare to MFCC-HMM detector
14
Cotton & Ellis ’11
freq
(Hz)
44213983474
44213983474
44213983474
44213983474
0.0 0.5 0.0 0.5time (s)
0.0 0.5 0.0 0.5 0.0 0.5
Patch 1 Patch 2 Patch 3 Patch 4 Patch 5
Patch 6 Patch 7 Patch 8 Patch 9 Patch 10
Patch 11 Patch 12 Patch 13 Patch 14 Patch 15
Patch 16 Patch 17 Patch 18 Patch 19 Patch 20
10 15 20 25 300
50
100
150
200
250
300
SNR (dB)
Acou
stic E
vent
Erro
r Rat
e (A
EER%
)
Error Rate in Noise
MFCCNMFcombined
• NMF more noise-robustcombines well ...
Environmental Sounds - Dan Ellis 2013-06-01 /34
Why Are Events Hard?• Events are short
target sounds may occupy only a few % of time
• Events are variedwhat is the vocabulary? what are the prototypes?source & channel variability
• Critical information is in fine-time structureonset transient etc.poor match to classic frame-spectral-envelope features
15
Environmental Sounds - Dan Ellis 2013-06-01 /34
3. Background Retrieval• Baseline for soundtrack classification
divide sound into short frames (e.g. 30 ms)calculate features (e.g. MFCC) for each framedescribe clip by statistics of frames (mean, covariance)= “bag of features”
• Classify by e.g. Mahalanobis distance + SVM16
VTS_04_0001 - Spectrogram
freq
/ kH
z
1 2 3 4 5 6 7 8 9012345678
-20
-10
0
10
20
30
time / sec
time / sec
level / dB
value
MFC
C b
in
1 2 3 4 5 6 7 8 92468
101214161820
-20-15-10-505101520
MFCC dimension
MFC
C d
imen
sion
MFCC covariance
5 10 15 20
2
4
6
8
10
12
14
16
18
20
-50
0
50
Video Soundtrack
MFCCfeatures
MFCCCovariance
Matrix
K. Lee & Ellis 2010
Environmental Sounds - Dan Ellis 2013-06-01 /34
Retrieval Evaluation• Rank large test set
by match to category
• Precision-Recall
•
17
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Prec
isio
n
CCV Precision−Recall (mfcc+sbpca)
MusicPerformance − 0.69Birthday − 0.49Graduation − 0.46IceSkating − 0.42Swimming − 0.37Dog − 0.27WeddingReception − 0.18
Labels
Clas
sifier
s
CCV Test ! mfc230 ! AP (mean=0.345)
Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr Wc Wd Mp Np Pa Be Pl RNBasketball
BaseballSoccer
IceSkatingSkiing
SwimmingBiking
CatDogBird
GraduationBirthday
WedRecepWedCeremWedDanceMusicPerf
NonMusicPerfParadeBeach
PlaygroundRAND
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
• mean Average Precision
Environmental Sounds - Dan Ellis 2013-06-01 /34
Retrieval Examples• High precision for top hits (in-domain)
18
Environmental Sounds - Dan Ellis 2013-06-01 /34
Sound Texture Features• Characterize sounds by
perceptually-sufficient statistics
.. verified by matched resynthesis
• Subbanddistributions & env x-corrsMahalanobis distance ...
19
McDermott et al. ’09Ellis, Zheng, McDermott ’11
SoundAutomatic
gaincontrol
Envelopecorrelation
Cross-bandcorrelations
(318 samples)
Modulationenergy(18 x 6)
mean, var,skew, kurt(18 x 4)
melfilterbank
(18 chans) x
x
xxxx
FFT Octave bins0.5,1,2,4,8,16 Hz
Histogram
617
1273
2404
5
10
15
0.5 2 8 32
0.5 2 8 320.5 2
8 32
5 10 15 8 32
5 10 15
0level
1
freq
/ Hz
mel
ban
d
1159_10 urban cheer clap
Texture features
1062_60 quiet dubbed speech music
0 2 4 6 8 10time / s
moments mod frq / Hz mel band moments mod frq / Hz mel band
0 2 4 6 8 10
K S V MK S V M
Environmental Sounds - Dan Ellis 2013-06-01 /34
Sound Texture Features• Test on MED 2010
development data10 audio-oriented manual labels
20
MED 2010 Txtr Feature (normd) means
M V S K ms cbcrural
urbanquietnoisyother
DubbedSpeech
MusicCheercLap
!0.5
0
0.5
M V S K ms cbcMED 2010 Txtr Feature (normd) stddevs
M V S K ms cbcrural
urbanquietnoisyother
DubbedSpeech
MusicCheercLap
0
0.5
1
M V S K ms cbc
MED 2010 Txtr Feature (normd) means/stddevs
M V S K ms cbcrural
urbanquietnoisyother
DubbedSpeech
MusicCheercLap
!0.5
0
0.5
Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg0.5
0.6
0.7
0.8
0.9
1
MMVSKmscbcMVSK+msMVSK+cbcMVSK+ms+cbc
• Per-class statsrelate dimensions to classes?
AP
Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg0.5
0.6
0.7
0.8
0.9
1
mfcctxtrmfcc+txtr• Perform ~ same as MFCCs
covariance ~ texture?
Environmental Sounds - Dan Ellis 2013-06-01 /34
Auditory Model Features• Subband Autocorrelation PCA (SBPCA)
Simplified version of Lyon et al. system10x faster (RTx5 → RT/2)
• Captures fine time structure in multiple bands.. missing in MFCC features
21
Cochleafilterbank
Sound
short-timeautocorrelation
delay line
frequ
ency
chan
nels
Correlogramslice
freqlag
time
Subband VQ
Subband VQ
Subband VQ
Subband VQ
Histogram
Feat
ure
Vecto
r
Lyon et al. 2010Cotton & Ellis 2013
Environmental Sounds - Dan Ellis 2013-06-01 /34
Subband Autocorrelation• Autocorrelation stabilizes
fine time structure
22
Cochleafilterbank
Sound
short-timeautocorrelation
delay line
frequ
ency
chan
nels
Correlogramslice
freqlag
time
Subband VQ
Subband VQ
Subband VQ
Subband VQ
Histogram
Feat
ure
Vecto
r
25 ms window, lags up to 25 mscalculated every 10 msnormalized to max (zero lag)
Environmental Sounds - Dan Ellis 2013-06-01 /34
Auditory Model Feature Results• SAI and SBPCA
close to MFCC baseline
• Fusing MFCC and SBPCA improvesmAP by 15% relmAP: 0.35 → 0.40
• Calculation timeMFCC: 6 hoursSAI: 1087 hoursSBPCA: 110 hours
23
0 0.2 0.4 0.6 0.8 1
Basketball
Baseball
Soccer
IceSkating
Skiing
Swimming
Biking
Cat
Dog
Bird
Graduation
Birthday
WeddingRecep
WeddingCerem
WeddingDance
MusicPerfmnce
NonMusicPerf
Parade
Beach
Playground
MAP
Avg. Precision (AP)
All 3MFCC + SBPCAMFCC + SAISAI + SBPCAMFCCSAISBPCA
Environmental Sounds - Dan Ellis 2013-06-01 /34
What is Being Recognized?• Soundtracks represented by global features
MFCC covariance, codebook histogramsWhat are the critical parts of the sound?
24
time / sec
freq
/ kHz
Clas
sifier
s
k5tiYqzvuoc Dog
0 10 20 30 40 50 60 70
0
1
2
3
4
BasketballBaseball
SoccerIceSkating
SkiingSwimming
BikingCat
DogBird
GraduationBirthday
eWeddingReceptioneWeddingCeremony
WeddingDanceuMusicPerformanceaNonMusicPerformance
ParadeBeach
Playground
Environmental Sounds - Dan Ellis 2013-06-01 /34
Semantic Audio Features• Train classifiers on related labeled data
defines a new “semantic” feature space
• Use fortarget classifier
or combo
25
SegmentStatistics
SoundtrackSemanticFeatures
Sound MFCCs 55Pre-Trained
SVMs
EventClassifiers
0
0.1
0.2
0.3
0.4
0.5
0.6 mfccs-230
sema100mfcc
sbpcas-4000
sema100sbpca
sema100sum
Environmental Sounds - Dan Ellis 2013-06-01 /34
4. Labels & Annotation• “Semantic Features” are a promising approach
but we need good coverage...how to learn more categories?
• Annotation is expensivefine time annotation > 10x real-timea few hours are available
• What to label?generic vs. task-specific
26
Burger et al. ’12
animal singing clatter anim_bird music_sing rustle anim_cat music scratch anim_ghoat knock hammer anim_horse thud washboard human_noise clap applause laugh click whistle scream bang squeak child beep tone mumble engine_quiet sirene speech engine_light water speech_ne power_tool micro_blow radio engine_heavy wind white_noise cheer other_creak crowd
Environmental Sounds - Dan Ellis 2013-06-01 /34
BBC Audio Semantic Classes• BBC Sound
Effects Library2238 tracks (60 h) short descriptions
• Use top 45 keywords
27
SFX001–04–01 Wood Fire Inside Stove 5:07 ! SFX001–05–01 City Skyline City Skyline 9:46 ! SFX001–06–01 High Street With Traffic, Footsteps! SFX001–07–01 Car Wash Automatic, Wash Phase Inside R ! SFX001–08–01 Motor Cycle Yamaha Rd 350: Motor Cycle ! SFX001–09–01 Motor Cycle Yamaha Rd 350, Rider Runs U!
290 footsteps 59 men! 267 on 59 general! 240 animals 55 switch! 197 ambience 53 starts! 193 interior 53 crowds! ... ...!
Labels
Clas
sifier
s
BBC2238 traindev1 - Average Precision (mean=0.678)
st pawaworuwawofo cowomehoemco fuambasp st cavo si an rh an bi ru el ho ai spcacabo tr ur tr cr el tr st chho cr voRNstairs
pavementwalkingwomenrunningwarfarewooden
footstepscountry
woodmen
horsesemergency
constructionfuturistic
ambiencebabiessportssteam
catvocals
sirenanimalsrhythms
animalbirdsrural
electronichorse
aircraftspeech
carscar
boattransport
urbantrain
crowdelectric
trafficstreet
childrenhousehold
crowdsvoicesRAND
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
• Added as “semantic units”some redundancy visible in mutual APs
Environmental Sounds - Dan Ellis 2013-06-01 /34
BBC Audio Semantic Classes• Limited semantic correspondence
28
time / s
Freq
uenc
y
SFX051!05 (ambience crowds)
0 50 100 150 200 2500
1000
2000
3000
4000
stairspavementwalkingwomenrunningwarfarewoodenfootstepscountrywoodmenhorsesemergencyconstructionfuturisticambiencebabiessportssteamcatvocalssirenanimalsrhythmsanimalbirdsruralelectronichorseaircraftspeechcarscarboattransporturbantraincrowdelectrictrafficstreetchildrenhouseholdcrowdsvoices
Ambience, London Covent Garden Exterior, Crowds In The Piazza On A Busy Saturday Afternoon, Occasional Busker S Music In The Distance
Environmental Sounds - Dan Ellis 2013-06-01 /34
Label Temporal Refinement• Audio Ground Truth at coarse time resolution
better-focused labels give better classifiers?but little information in very short time frames
• Train classifiers on shorter (2 sec) segments?Initial labels applyto whole clip
Relabel based onmost likely segmentsin clip
Retrain classifier
29
TagsVideo 0EwtL2yr0y8
Parade
Time LabelsFeatures
Time LabelsFeatures
Posteriors
03FF1Sz53YU
-
Parade
Parade P
SVM TrainSVM Classify
SVM ClassifySVM Train
Itera
tion
0Ite
ratio
n 1
K Lee, Ellis, Loui ’10
Environmental Sounds - Dan Ellis 2013-06-01 /34
Label Temporal Refinement• Refining labels is “Multiple Instance Learning”
“Positive” clips have at least one +ve frame“Negative” clips are all –ve
• Refine based on previous classifier’s scores
30
threshold from CDFs of +ve and –ve framesmAP improves~10% after afew iterations
Environmental Sounds - Dan Ellis 2013-06-01 /34
5. Future: Tasks & Metrics• Environmental sound recognition:
What is it good for?media content description (“Recounting”)environmental awareness
• What are the right ways to evaluate?task-specific metrics: AEER, F-measuredownstream tasks: WER, mAPreal applications: archive search, aware devices
31
Environmental Sounds - Dan Ellis 2013-06-01 /34
Labels & Annotations• Training data: quality vs. quantity
quality costs:- DCASE ~ 0.3 h
- TRECVID MED ~ 10 h
quantity always wins
• Opportunistic labelinge.g. Sound Effects library, subtitles ...need refinement strategies
• Existing annotations indicate interest32
Susanne Burger CMU
Environmental Sounds - Dan Ellis 2013-06-01 /34
Source Separation• Separated sources makes event detection easy
“separate then recognize” paradigm
integrated solution more powerful...
• Environmental Source Separation is ill-definedrelevant “sources” are listener-definedenvironment description addresses thisEnvironment recognition for source separation
33
mixseparation
words
identifytarget energy
sourceknowledge
speechmodels
t-f masking+ resynthesis ASR
mix wordsidentify
speechmodels
find bestwords model
vs.
Environmental Sounds - Dan Ellis 2013-06-01 /34
Summary• (Machine) Listening:
Getting useful information from sound
• Foreground event recognition... by focusing on peak energy patches
• Background sound retrieval... from long-time statistics
• Data, Labels, and Task... what are the sources of interest?
34
Environmental Sounds - Dan Ellis 2013-06-01 /34
References 1/2• Samer Abdallah, Mark Plumbley, “Polyphonic music transcription by non-negative
sparse coding of power spectra,” ISMIR, 2004, p. 10-14.
• Susanne Burger, Qin Jin, Peter F. Schulam, Florian Metze, “Noisemes: Manual annotation of environmental noise in audio streams,” Technical Report LTI-12-017, CMU, 2012.
• Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011.
• Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” IEEE WASPAA, 2011, 69-72.
• Courtenay Cotton and Dan Ellis, “Subband Autocorrelation Features for Video Soundtrack Classification,” IEEE ICASSP, Vancouver, May 2013, 8663-8666.
• Dan Ellis, “Prediction-driven computational auditory scene analysis.” Ph.D. thesis, MIT Dept of EECS, 1996.
• Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011.
• Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange and Mark Plumbley, “Detection and Classification of Acoustic Scenes and Events,” Technical Report EECSRR-13-01, QMUL School of EE and CS, 2013.
• Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Dan Ellis, Alex Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine performance,” ACM ICMR, Apr. 2011, p. 29.
35
Environmental Sounds - Dan Ellis 2013-06-01 /34
References 2/2• Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in
environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010.
• Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE TASLP. 18(6): 1406-1416, Aug. 2010.
• R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, pp. 2390–2416, Sept. 2010.
• Josh McDermott, Andrew Oxenham, Eero Simoncelli, “Sound texture synthesis via filter statistics,” IEEE WASPAA, 2009, 297-300.
• Paul Over, George Awad, Jon Fiscus, Brian Antonishek, Martial Michel, Alan Smeaton, Wessel Kraaij, Wessel, Georges Quénot, “TRECVID 2010 - An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” Technical Report, NIST, 2011.
• Manuel Reyes-Gomez & Dan Ellis, “Selection, Parameter Estimation, and Discriminative Training of Hidden Markov Models for General Audio Modeling,” IEEE ICME, Baltimore, 2003, I-73-76.
• Paris Smaragdis & Judith Brown, “Non-negative matrix factorization for polyphonic music transcription,” IEEE WASPAA, 2003, p. 177-180.
• Tuomas Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE TASLP 15(3): 1066-1074, 2007.
36
Environmental Sounds - Dan Ellis 2013-06-01 /34
AcknowledgmentSupported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
37