recognizing and classifying environmental soundsdpwe/talks/chime-2013-05.pdf · environmental...

Environmental Sounds - Dan Ellis 2013-06-01 /341

1. Environmental Sound Recognition2. Foreground Events3. Background Retrieval4. Labels & Annotation5. Future Directions

Recognizing and ClassifyingEnvironmental Sounds

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

mailto:[email protected]

mailto:[email protected]

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


1. What is hearing for?

• Hearing = getting information from soundpredators/preycommunication

• Environmental sound recognition is fundamental

2

http://news.bbc.co.uk/2/hi/science/nature/7130484.stm




Environmental Sound Perception

• What do people hear?

sourcesambience

• Mixtures are the rule

3

S1ïhonk honk

S2ïsqueek

S2ïfinal horn

S2ïhorn during Squeek

S3ïcrash (not car)

S3ïcloseup car

S3ï1st horn

S3ï2nd horn

S3ï2nd horn

S3ï3rd horn

S3ïsqueal

S1ïslam

S4ïhorn1

S4ïcrash

S4ïsqueal

S4ïhorn2

S5ïHonk

S5ïTrash can

S5ïHonk 2

S5ïBeepïBeep

S5ïAcceleration

S6ïdouble horn

S1ïhonk, honk

S6ïslam

S6ïdoppler horn

S6ïacceleration

S6ïhorn3

S6ïairbrake

S6ïend double horn

S7ïgunshot

S7ïhornS7ïhorn2

S7ïhorn3

S1ïsqueal

S7ïhorn4

S7ïhorn5

S7ïhorn6

S8ïcar horns

S8ïcar horns

S8ïcar horns

S8ïcar horns

S8ïcar horns

S8ïlarge object crash

S8ïtruck engine

S8ïbreak Squeaks

S9ïhorn 1

S1ïrev up/passing

S9ïhorn 2

S9ïhorn 3

S9ïhorn 4

S9ïdoor Slam?

S9ïhorn 5

S10ïcar horn

S10ïcar horn

S10ïcar horn

S10ïcar horn

S10ïdoor slamming

S10ïwheels on road

S2ïfirst double horn

S2ïcrash

S2ïhorn during crash

S2ïtruck accelerating

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)

Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

Ellis 1996


Sound Scene Evaluations• Evaluations are good for research

help researchers, help funders

• A decade of evaluations:

Metrics: SNR, Frame Acc, Event Error Rate, mAP

4

D-CASE

TRECVID MED

ADC MIREXCHIL/CLEAR

SASSEC SiSECCHiME

Albayzin

SSC/PASCAL

NIST MtgRm

20132012201120102009200820072006200520042003

Meeting roomsAcoustic events

Music transcription

Source separationSpeech separation

Video (soundtrack)classification

Segmentation

VideoCLEF / MediaEvalTRECVID{


DCASE (2012-2013)• 2012 IEEE AASP Challenge:

“Detection and Classification of Acoustic Scenes and Events”Systems submitted Mar 2013Results at WASPAA, Oct 20132 tasks...

• Task 1: Scene classification10 classes x 10 examples x 30 s- street, supermarket, restaurant, office, park ...

evaluate on 100 examples (~1 hour total) by classification accuracy

5

freq

/ kHz

bus01

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

freq

/ kHz

024

busystreet01

office01 openairmarket01

park01 quietstreet01

restaurant01 supermarket01

tube01

5 10 15 20 25 time / s

tubestation01

5 10 15 20 25

Giannoulis et al. 2013


DCASE Event Detection• Task 2: Event detection (“office live”)

16 events x 20 training examples (~ 20 min total)knock, laugh, drawer, keys, phone ...evaluate on ~15 min (?)metrics: frame-level AEER& event-level precision-recall

6


TRECVID MED (2010-2013)• “Multimedia Event Detection”

e.g. MED2011: 15 events x 200 example videos (~60s)- Making a sandwich, Birthday party, Parade, Flash mob

evaluate by mean Average Precisionover 10k-100k videos (200-2,000 hours)audio and video ...- participants have annotated ~1000 videos (> 10 h)

7

E009 Getting a Vehicle Unstuck

Over et al. 2011


Consumer Video Dataset• Columbia Consumer Video (CCV)

9,317 videos / 210 hours20 concepts based on consumer user studyLabeled via Amazon Mechanical Turk

8

!

Y-G. Jiang et al. 2011


Environmental Sound Motivations

• Audio Lifelog Diarization

• Consumer Video Classification & Search

• Real-time hearing prosthesis app

• Robot environment sensitivity

• Understanding hearing

9

09:00

09:30

10:00

10:30

11:00

11:30

12:00

12:30

13:00

13:30

14:00

14:30

15:00

15:30

16:00

16:30

17:00

17:30

18:00

preschool

cafepreschoolcafelecture

officeofficeoutdoorgroup

lab

cafemeeting2

office

office

outdoorlabcafe

Ron

Manuel

Arroyo?

Lesser

2004-09-13

L2cafeoffice

outdoorlecture

outdoormeeting2

outdoorofficecafeoffice

outdoorofficepostlecoffice

DSP03

compmtg

Mike

Sambarta?

2004-09-14


2. Foreground Event Recognition• “Events” are what we hear / notice

• ASR approach?

events = words? what are subwords?need labeled databut ... mature tools are great

10

Reyes-Gomes & Ellis 2003

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

12345678

1.15 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2time

freq

/ kH

z

dogBarks2


Transient Features

• Transients = foreground events?

• Onset detector finds energy burstsbest SNR

• PCA basis to represent each300 ms x auditory freq

• “bag of transients”

11

Cotton, Ellis, Loui ’11


Transient Features• Results show a

small benefitsimilar to MFCC baseline?

• Examine clusterslooking for semantic consistency...link cluster to label

12

0 0.2 0.4 0.6 0.8 1MAP

museumpicnic

weddinganimalsunset

skibirthday

graduationsports

boatparade

playgroundbabypark

beachdancing

showgroup of 2

night1 person

singingcheercrowdmusic

group of 3+

average precision

class

guessMFCC ftrsevent ftrsfusion


NMF Transient Features

• Decompose spectrograms into templates + activation

well-behaved gradient descent algorithm2D patchessparsity control

computation time...13

Smaragdis & Brown ’03Abdallah & Plumbley ’04

Virtanen ’07

X = W · H

Original mixture

Basis 1(L2)

Basis 2(L1)

Basis 3(L1)

freq

/ Hz

442883

1398220334745478

0 1 2 3 4 5 6 7 8 9 10time / s


NMF Transient Features• Learn 20 patches from

CLEAR Meeting Room events

• Compare to MFCC-HMM detector

14

Cotton & Ellis ’11

freq

(Hz)

44213983474

44213983474

44213983474

44213983474

0.0 0.5 0.0 0.5time (s)

0.0 0.5 0.0 0.5 0.0 0.5

Patch 1 Patch 2 Patch 3 Patch 4 Patch 5




10 15 20 25 300

50

100

150

200

250

300

SNR (dB)

Acou

stic E

vent

Erro

r Rat

e (A

EER%

)

Error Rate in Noise

MFCCNMFcombined

• NMF more noise-robustcombines well ...


Why Are Events Hard?• Events are short

target sounds may occupy only a few % of time

• Events are variedwhat is the vocabulary? what are the prototypes?source & channel variability

• Critical information is in fine-time structureonset transient etc.poor match to classic frame-spectral-envelope features

15


3. Background Retrieval• Baseline for soundtrack classification

divide sound into short frames (e.g. 30 ms)calculate features (e.g. MFCC) for each framedescribe clip by statistics of frames (mean, covariance)= “bag of features”

• Classify by e.g. Mahalanobis distance + SVM16

VTS_04_0001 - Spectrogram

freq

/ kH

z

1 2 3 4 5 6 7 8 9012345678

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MFC

C b

in

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimension

MFC

C d

imen

sion

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix

K. Lee & Ellis 2010


Retrieval Evaluation• Rank large test set

by match to category

• Precision-Recall

•

17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

CCV Precision−Recall (mfcc+sbpca)

MusicPerformance − 0.69Birthday − 0.49Graduation − 0.46IceSkating − 0.42Swimming − 0.37Dog − 0.27WeddingReception − 0.18

Labels

Clas

sifier

s

CCV Test ! mfc230 ! AP (mean=0.345)

Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr Wc Wd Mp Np Pa Be Pl RNBasketball

BaseballSoccer

IceSkatingSkiing

SwimmingBiking

CatDogBird

GraduationBirthday

WedRecepWedCeremWedDanceMusicPerf

NonMusicPerfParadeBeach

PlaygroundRAND

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

• mean Average Precision


Retrieval Examples• High precision for top hits (in-domain)

18


Sound Texture Features• Characterize sounds by

perceptually-sufficient statistics

.. verified by matched resynthesis

• Subbanddistributions & env x-corrsMahalanobis distance ...

19

McDermott et al. ’09Ellis, Zheng, McDermott ’11

SoundAutomatic

gaincontrol

Envelopecorrelation

Cross-bandcorrelations

(318 samples)

Modulationenergy(18 x 6)

mean, var,skew, kurt(18 x 4)

melfilterbank

(18 chans) x

x

xxxx

FFT Octave bins0.5,1,2,4,8,16 Hz

Histogram

617

1273

2404

5

10

15

0.5 2 8 32

0.5 2 8 320.5 2

8 32

5 10 15 8 32

5 10 15

0level

1

freq

/ Hz

mel

ban

d

1159_10 urban cheer clap

Texture features

1062_60 quiet dubbed speech music

0 2 4 6 8 10time / s

moments mod frq / Hz mel band moments mod frq / Hz mel band

0 2 4 6 8 10

K S V MK S V M


Sound Texture Features• Test on MED 2010

development data10 audio-oriented manual labels

20

MED 2010 Txtr Feature (normd) means

M V S K ms cbcrural

urbanquietnoisyother

DubbedSpeech

MusicCheercLap

!0.5

0

0.5

M V S K ms cbcMED 2010 Txtr Feature (normd) stddevs

M V S K ms cbcrural


DubbedSpeech

MusicCheercLap

0

0.5

1

M V S K ms cbc

MED 2010 Txtr Feature (normd) means/stddevs

M V S K ms cbcrural


DubbedSpeech

MusicCheercLap

!0.5

0

0.5

Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg0.5

0.6

0.7

0.8

0.9

1

MMVSKmscbcMVSK+msMVSK+cbcMVSK+ms+cbc

• Per-class statsrelate dimensions to classes?

AP

Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg0.5

0.6

0.7

0.8

0.9

1

mfcctxtrmfcc+txtr• Perform ~ same as MFCCs

covariance ~ texture?


Auditory Model Features• Subband Autocorrelation PCA (SBPCA)

Simplified version of Lyon et al. system10x faster (RTx5 → RT/2)

• Captures fine time structure in multiple bands.. missing in MFCC features

21

Cochleafilterbank

Sound

short-timeautocorrelation

delay line

frequ

ency

chan

nels

Correlogramslice

freqlag

time

Subband VQ

Subband VQ

Subband VQ

Subband VQ

Histogram

Feat

ure

Vecto

r

Lyon et al. 2010Cotton & Ellis 2013


Subband Autocorrelation• Autocorrelation stabilizes

fine time structure

22

Cochleafilterbank

Sound

short-timeautocorrelation

delay line

frequ

ency

chan

nels

Correlogramslice

freqlag

time

Subband VQ

Subband VQ

Subband VQ

Subband VQ

Histogram

Feat

ure

Vecto

r

25 ms window, lags up to 25 mscalculated every 10 msnormalized to max (zero lag)


Auditory Model Feature Results• SAI and SBPCA

close to MFCC baseline

• Fusing MFCC and SBPCA improvesmAP by 15% relmAP: 0.35 → 0.40

• Calculation timeMFCC: 6 hoursSAI: 1087 hoursSBPCA: 110 hours

23

0 0.2 0.4 0.6 0.8 1

Basketball

Baseball

Soccer

IceSkating

Skiing

Swimming

Biking

Cat

Dog

Bird

Graduation

Birthday

WeddingRecep

WeddingCerem

WeddingDance

MusicPerfmnce

NonMusicPerf

Parade

Beach

Playground

MAP

Avg. Precision (AP)

All 3MFCC + SBPCAMFCC + SAISAI + SBPCAMFCCSAISBPCA


What is Being Recognized?• Soundtracks represented by global features

MFCC covariance, codebook histogramsWhat are the critical parts of the sound?

24

time / sec

freq

/ kHz

Clas

sifier

s

k5tiYqzvuoc Dog

0 10 20 30 40 50 60 70

0

1

2

3

4

BasketballBaseball

SoccerIceSkating

SkiingSwimming

BikingCat

DogBird

GraduationBirthday

eWeddingReceptioneWeddingCeremony

WeddingDanceuMusicPerformanceaNonMusicPerformance

ParadeBeach

Playground


Semantic Audio Features• Train classifiers on related labeled data

defines a new “semantic” feature space

• Use fortarget classifier

or combo

25

SegmentStatistics

SoundtrackSemanticFeatures

Sound MFCCs 55Pre-Trained

SVMs

EventClassifiers

0

0.1

0.2

0.3

0.4

0.5

0.6 mfccs-230

sema100mfcc

sbpcas-4000

sema100sbpca

sema100sum


4. Labels & Annotation• “Semantic Features” are a promising approach

but we need good coverage...how to learn more categories?

• Annotation is expensivefine time annotation > 10x real-timea few hours are available

• What to label?generic vs. task-specific

26

Burger et al. ’12

animal singing clatter anim_bird music_sing rustle anim_cat music scratch anim_ghoat knock hammer anim_horse thud washboard human_noise clap applause laugh click whistle scream bang squeak child beep tone mumble engine_quiet sirene speech engine_light water speech_ne power_tool micro_blow radio engine_heavy wind white_noise cheer other_creak crowd


BBC Audio Semantic Classes• BBC Sound

Effects Library2238 tracks (60 h) short descriptions

• Use top 45 keywords

27

SFX001–04–01 Wood Fire Inside Stove 5:07 ! SFX001–05–01 City Skyline City Skyline 9:46 ! SFX001–06–01 High Street With Traffic, Footsteps! SFX001–07–01 Car Wash Automatic, Wash Phase Inside R ! SFX001–08–01 Motor Cycle Yamaha Rd 350: Motor Cycle ! SFX001–09–01 Motor Cycle Yamaha Rd 350, Rider Runs U!

290 footsteps 59 men! 267 on 59 general! 240 animals 55 switch! 197 ambience 53 starts! 193 interior 53 crowds! ... ...!

Labels

Clas

sifier

s

BBC2238 traindev1 - Average Precision (mean=0.678)

st pawaworuwawofo cowomehoemco fuambasp st cavo si an rh an bi ru el ho ai spcacabo tr ur tr cr el tr st chho cr voRNstairs

pavementwalkingwomenrunningwarfarewooden

footstepscountry

woodmen

horsesemergency

constructionfuturistic

ambiencebabiessportssteam

catvocals

sirenanimalsrhythms

animalbirdsrural

electronichorse

aircraftspeech

carscar

boattransport

urbantrain

crowdelectric

trafficstreet

childrenhousehold

crowdsvoicesRAND

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

• Added as “semantic units”some redundancy visible in mutual APs


BBC Audio Semantic Classes• Limited semantic correspondence

28

time / s

Freq

uenc

y

SFX051!05 (ambience crowds)

0 50 100 150 200 2500

1000

2000

3000

4000

stairspavementwalkingwomenrunningwarfarewoodenfootstepscountrywoodmenhorsesemergencyconstructionfuturisticambiencebabiessportssteamcatvocalssirenanimalsrhythmsanimalbirdsruralelectronichorseaircraftspeechcarscarboattransporturbantraincrowdelectrictrafficstreetchildrenhouseholdcrowdsvoices

Ambience, London Covent Garden Exterior, Crowds In The Piazza On A Busy Saturday Afternoon, Occasional Busker S Music In The Distance


Label Temporal Refinement• Audio Ground Truth at coarse time resolution

better-focused labels give better classifiers?but little information in very short time frames

• Train classifiers on shorter (2 sec) segments?Initial labels applyto whole clip

Relabel based onmost likely segmentsin clip

Retrain classifier

29

TagsVideo 0EwtL2yr0y8

Parade

Time LabelsFeatures

Time LabelsFeatures

Posteriors

03FF1Sz53YU

-

Parade

Parade P

SVM TrainSVM Classify

SVM ClassifySVM Train

Itera

tion

0Ite

ratio

n 1

K Lee, Ellis, Loui ’10


Label Temporal Refinement• Refining labels is “Multiple Instance Learning”

“Positive” clips have at least one +ve frame“Negative” clips are all –ve

• Refine based on previous classifier’s scores

30

threshold from CDFs of +ve and –ve framesmAP improves~10% after afew iterations


5. Future: Tasks & Metrics• Environmental sound recognition:

What is it good for?media content description (“Recounting”)environmental awareness

• What are the right ways to evaluate?task-specific metrics: AEER, F-measuredownstream tasks: WER, mAPreal applications: archive search, aware devices

31


Labels & Annotations• Training data: quality vs. quantity

quality costs:- DCASE ~ 0.3 h

- TRECVID MED ~ 10 h

quantity always wins

• Opportunistic labelinge.g. Sound Effects library, subtitles ...need refinement strategies

• Existing annotations indicate interest32

Susanne Burger CMU


Source Separation• Separated sources makes event detection easy

“separate then recognize” paradigm

integrated solution more powerful...

• Environmental Source Separation is ill-definedrelevant “sources” are listener-definedenvironment description addresses thisEnvironment recognition for source separation

33

mixseparation

words

identifytarget energy

sourceknowledge

speechmodels

t-f masking+ resynthesis ASR

mix wordsidentify

speechmodels

find bestwords model

vs.


Summary• (Machine) Listening:

Getting useful information from sound

• Foreground event recognition... by focusing on peak energy patches

• Background sound retrieval... from long-time statistics

• Data, Labels, and Task... what are the sources of interest?

34


References 1/2• Samer Abdallah, Mark Plumbley, “Polyphonic music transcription by non-negative

sparse coding of power spectra,” ISMIR, 2004, p. 10-14.

• Susanne Burger, Qin Jin, Peter F. Schulam, Florian Metze, “Noisemes: Manual annotation of environmental noise in audio streams,” Technical Report LTI-12-017, CMU, 2012.

• Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011.

• Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” IEEE WASPAA, 2011, 69-72.

• Courtenay Cotton and Dan Ellis, “Subband Autocorrelation Features for Video Soundtrack Classification,” IEEE ICASSP, Vancouver, May 2013, 8663-8666.

• Dan Ellis, “Prediction-driven computational auditory scene analysis.” Ph.D. thesis, MIT Dept of EECS, 1996.

• Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011.

• Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange and Mark Plumbley, “Detection and Classification of Acoustic Scenes and Events,” Technical Report EECSRR-13-01, QMUL School of EE and CS, 2013.

• Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Dan Ellis, Alex Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine performance,” ACM ICMR, Apr. 2011, p. 29.

35

http://www.ee.columbia.edu/~dpwe/pubs/CottonE13-subband.pdf





References 2/2• Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in

environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010.

• Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE TASLP. 18(6): 1406-1416, Aug. 2010.

• R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, pp. 2390–2416, Sept. 2010.

• Josh McDermott, Andrew Oxenham, Eero Simoncelli, “Sound texture synthesis via filter statistics,” IEEE WASPAA, 2009, 297-300.

• Paul Over, George Awad, Jon Fiscus, Brian Antonishek, Martial Michel, Alan Smeaton, Wessel Kraaij, Wessel, Georges Quénot, “TRECVID 2010 - An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” Technical Report, NIST, 2011.

• Manuel Reyes-Gomez & Dan Ellis, “Selection, Parameter Estimation, and Discriminative Training of Hidden Markov Models for General Audio Modeling,” IEEE ICME, Baltimore, 2003, I-73-76.

• Paris Smaragdis & Judith Brown, “Non-negative matrix factorization for polyphonic music transcription,” IEEE WASPAA, 2003, p. 177-180.

• Tuomas Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE TASLP 15(3): 1066-1074, 2007.

36

http://www.ee.columbia.edu/~dpwe/pubs/LeeE10-semantic.pdf




http://www.ee.columbia.edu/~dpwe/pubs/icme03-hmmsel.pdf





AcknowledgmentSupported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon.

Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

37

recognizing and classifying environmental soundsdpwe/talks/chime-2013-05.pdf · environmental...

Documents