mpeg7-1998feb

8/14/2019 MPEG7-1998feb

1/20

Audio Indexing - Dan Ellis 1998feb04 - 1

Automatic audio analysis

for content description & indexing

Dan Ellis

International Computer Science Institute, Berkeley CA

Outline

Auditory Scene Analysis (ASA)

Computational ASA (CASA)

Prediction-driven CASA

Speech recognition & sound mixtures

Implications for content analysis

1

2

3

4

5

8/14/2019 MPEG7-1998feb

2/20


200

400

1000

2000

4000

f/Hzcity22

0 1 2 3 4 5 6 7 8 9time/s

70

60

50

40

dB

Auditory Scene Analysis

The organization of complex sound scenes

according to their inferred sources

Sounds rarely occur in isolation

- organization required for useful information

Human audition is very effective

- unexpectedly difficult to model

Correct analysis defined by goal

- source shows independence, continuity

ecological constraints enable organization

1
http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/

8/14/2019 MPEG7-1998feb

3/20


Psychology of ASA

Extensive experimental research

- organization of simple pieces

(sinusoids & white noise)

- streaming, pitch perception, double vowels

Auditory Scene Analysis [Bregman 1990]

grouping rules

- common onset/offset/modulation,

harmonicity, spatial location

Debated... (Darwin, Carlyon, Moore, Remez)

(from

Darwin 1996)

8/14/2019 MPEG7-1998feb

4/20


Computational Auditory Scene Analysis

(CASA)

Automatic sound organization?

- convert an undifferentiated signal into a

description in terms of different sources

Translate psych. rules into programs?

- representations to reveal common onset,

harmonicity ...

Motivations & Applications

- its a puzzle: new processing principles?

- real-world interactive systems (speech, robots)

- hearing prostheses (enhancement, description)

- advanced processing (remixing)

- multimedia indexing...

2

200

400

1000

2000

4000

f/Hzcity22

0 1 2 3 4 5 6 7 8 9time/s

70

60

50

40

dB

car noise

horndoor crash

horn

yell

8/14/2019 MPEG7-1998feb

5/20


CASA survey

Early work on co-channel speech

- listeners benefit from pitch difference

- algorithms for separating periodicities

Utterance-sized signals need more

- cannot predict number of signals (0, 1, 2 ...)

- birth/death processes

Ultimately, more constraints needed

- nonperiodic signals- masked cues

- ambiguous signals

8/14/2019 MPEG7-1998feb

6/20


0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300

400

600

1000

1500

2000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300

400

600

1000

1500

2000

3000

frq/Hzbrn1h.fi.aif

CASA1: Periodic pieces

Weintraub 1985

- separate male & female voices

- find periodicities in each frequency channel by

auto-coincidence

- number of voices is hidden state

Cooke & Brown (1991-3)

- divide time-frequency plane into elements

- apply grouping rules to form sources

- pull single periodic target out of noise
http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnmix.aif/

8/14/2019 MPEG7-1998feb

7/20


CASA2: Hypothesis systems

Okuno et al. (1994-)

- tracers follow each harmonic + noise agent

- residue-driven: account for whole signal

Klassner 1996

- search for a combination of templates

- high-level hypotheses permit front-end tuning

Ellis 1996

- model for events perceived in dense scenes

- prediction-driven: observations - hypotheses

(a)

TIME

1.0 2.0 3.0 4.0 sec

Buzzer-Alarm

Glass-Clink

Phone-RingSiren-Chirp

420 Hz460 Hz500 Hz

1475 Hz

2230 Hz

2540 Hz

3760 Hz

2350 Hz

1675 Hz

950 Hz

(b)

1.0 2.0 3.0 4.0 sec

TIME

8/14/2019 MPEG7-1998feb

8/20


CASA3: Other approaches

Blind source separation (Bell & Sejnowski)

- find exact separation parameters by maximizing

statistic e.g. signal independence

HMM decomposition (RK Moore)

- recover combined source states directly

Neural models (Malsburg, Wang & Brown)

- avoid implausible AI methods (search, lists)

- oscillators substitute for iteration?

8/14/2019 MPEG7-1998feb

9/20


input

mixture

signalfeatures

discreteobjects

Front end Objectformation Groupingrules Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeatures

Front end Compare& reconcile

Hypothesismanagement

Predict& combine

Periodiccomponents

Noisecomponents

Prediction-driven CASA

Perception is not direct

but a searchfor plausible hypotheses

Data-driven...

vs. Prediction-driven

Motivations

- detect non-tonal events (noise & clicks)- support restoration illusions...

hooks for high-level knowledge

+ complete explanation, multiple hypotheses,

resynthesis

3

8/14/2019 MPEG7-1998feb

10/20


1000

2000

4000

f/Hzptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s

Analyzing the continuity illusion

Interrupted tone heard as continuous

- .. if the interruption could be a masker

Data-driven just sees gaps

Prediction-driven can accommodate

- special case or general principle?
http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/

8/14/2019 MPEG7-1998feb

11/20


Phonemic Restoration (Warren 1970)

Another illusion instance

Inference relies on high-level semantics

Incorporating knowledge into models?

1.2 1.3 1.4 1.5 1.6 1.7 time/s0

500

1000

1500

2000

2500

3000

3500frq/Hz

nsoffee.aif
http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/

8/14/2019 MPEG7-1998feb

12/20


Subjective ground-truth in mixtures?

Listening tests collect perceived events:

Consistent answers:

S3crash (notcar)

S3closeup car

S31sthorn

S1slam

S4horn1

S4crash

S5Honk

S5Trash can

S5Acceleration

S6double horn

S1honk,honk

S6slam

S6dopplerhorn

S6acceleration

S7gunshot

S7hornS7horn2

S7horn3

S8carhorns

S8carhorns

S8large objectcrash

S8truck engine

S1rev up/passing

S9horn 2

S9doorSlam?

S9horn 5

S10carhorn

S10doorslamming

S10wheels on road

S2firstdouble horn

S2crash

S2horn during crash

S2truck accelerating

200

400

1000

2000

4000

f/HzCity

0 1 2 3 4 5 6 7 8 9

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

8/14/2019 MPEG7-1998feb

13/20


PDCASA example:

City-street ambience

Problems

- error allocation - rating hypotheses

- source hierarchy - resynthesis

70

60

50

40

dB

200400

100020004000

f/HzNoise1

200400

10002000

4000

f/HzNoise2,Click1

200400

10002000

4000

f/HzCity

0 1 2 3 4 5 6 7 8 9

50100

200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

10002000

4000

f/Hz Wefts14

50100

200400

1000

Weft5 Wefts6,7 Weft8 Wefts912
http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://cityresyn.aif/http://cityorig.aif/

8/14/2019 MPEG7-1998feb

14/20


Speech recognition

& sound mixtures

Conventional speech recognition:

- signal assumed entirely speech

- find valid labelling by discrete labels

- class models from training data Some problems:

- need to ignore lexically-irrelevant variation

(microphone, voice pitch etc.)

- compact feature space everything speech-like

Very fragile to nonspeech, background

- scene-analysis methods very attractive...

4

Feature

extraction

Phoneme

classifier

HMM

decodersignal wordslow-dim.features

phonemeprobabilities

8/14/2019 MPEG7-1998feb

15/20


CASA for speech recognition

Data-driven: CASA as preprocessor- problems with holes (but: Cooke, Okuno)

- doesnt exploit knowledge of speech structure

Prediction-driven: speech as component

- same reconciliation of speech hypotheses

- need to express predictions in signal domain

inputmixture

Front end Compare& reconcile

Hypothesismanagement

Predict& combine

Periodiccomponents

Noisecomponents

Speechcomponents

8/14/2019 MPEG7-1998feb

16/20


Example of speech & nonspeech

Problems:

- undoing classification & normalization

- finding a starting hypothesis

- granularity of integration

5

10

15

f/Bark(a) Clap (clap8kenv.pf)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

(b) Speech plus clap (223clenv.pf)

h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#

h# n ay ntcl

t uw f ay v ow h# s eh v ax n

nine two five oh seven

(d) Reconstruction from labels alone (223clrenvG.pf)

(e) Slowlyvarying portion of original (223clenvg.pf)

(f) Predicted speech element ( = (d)+(e) ) (223clrenv.pf)

(g) Click5 from nonspeech analysis (justclick.pf)

(h) Spurious elements from nonspeech analysis (nonclicks.pf)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

20

40

60

dB

(c) Recognizer output
http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl.aif/

8/14/2019 MPEG7-1998feb

17/20


Implications for content analysis:

Using CASA to index soundtracks

What are the objects in a soundtrack?

- subjective definition need auditory model

Segmentation vs. classification

- low-level cues locate events

- higher-level learned knowledge to give

semantic label (footstep, crash)

... AI complete?

But: hard to separate

- illusion phenomena suggest auditory

organizationdepends on interpretation

5

200

400

1000

2000

4000

f/Hzcity22

0 1 2 3 4 5 6 7 8 9time/s

70

60

50

40

dB

car noise

horndoor crash

horn

yell

8/14/2019 MPEG7-1998feb

18/20


Using speech recognition for indexing

Active research area:Access to news broadcast databases

- e.g. Informedia (CMU), ThisL (BBC+...)

- use LV-CSR to transcribe,

then text-retrieval to find- 30-40% word error rate, still works OK

Several systems at NIST TREC workshop

Tricks to ignore nonspeech/poor speech

8/14/2019 MPEG7-1998feb

19/20


Open issues in automatic indexing

How to do ASA?

Explanation/description hierarchy

- PDCASA: generic primitives

+ constraining hierarchy

- subjective & task-dependent

Classification

- connecting subjective & objective properties

finding subjective invariants, prominence- representation of sound-object classes

Resynthesis?

- a good description should be adequate

- provided in PDCASA, but low quality- requires good knowledge-based constraints

8/14/2019 MPEG7-1998feb

20/20


Conclusions

Auditory organization is required in real

environments

We dont know how listeners do it!

- plenty of modeling interest Prediction-reconciliation can account for

illusions

- use knowledge when signal is inadequate

- important in a wider range of circumstances?

Speech recognizers are a good source of

knowledge

Automatic indexing implies synthetic listener

- need to solve a lot of modeling issues

6

mpeg7-1998feb

Documents