mpeg7-1998feb
TRANSCRIPT
-
8/14/2019 MPEG7-1998feb
1/20
Audio Indexing - Dan Ellis 1998feb04 - 1
Automatic audio analysis
for content description & indexing
Dan Ellis
International Computer Science Institute, Berkeley CA
Outline
Auditory Scene Analysis (ASA)
Computational ASA (CASA)
Prediction-driven CASA
Speech recognition & sound mixtures
Implications for content analysis
1
2
3
4
5
-
8/14/2019 MPEG7-1998feb
2/20
Audio Indexing - Dan Ellis 1998feb04 - 2
200
400
1000
2000
4000
f/Hzcity22
0 1 2 3 4 5 6 7 8 9time/s
70
60
50
40
dB
Auditory Scene Analysis
The organization of complex sound scenes
according to their inferred sources
Sounds rarely occur in isolation
- organization required for useful information
Human audition is very effective
- unexpectedly difficult to model
Correct analysis defined by goal
- source shows independence, continuity
ecological constraints enable organization
1
http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/ -
8/14/2019 MPEG7-1998feb
3/20
Audio Indexing - Dan Ellis 1998feb04 - 3
Psychology of ASA
Extensive experimental research
- organization of simple pieces
(sinusoids & white noise)
- streaming, pitch perception, double vowels
Auditory Scene Analysis [Bregman 1990]
grouping rules
- common onset/offset/modulation,
harmonicity, spatial location
Debated... (Darwin, Carlyon, Moore, Remez)
(from
Darwin 1996)
-
8/14/2019 MPEG7-1998feb
4/20
Audio Indexing - Dan Ellis 1998feb04 - 4
Computational Auditory Scene Analysis
(CASA)
Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
Translate psych. rules into programs?
- representations to reveal common onset,
harmonicity ...
Motivations & Applications
- its a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing...
2
200
400
1000
2000
4000
f/Hzcity22
0 1 2 3 4 5 6 7 8 9time/s
70
60
50
40
dB
car noise
horndoor crash
horn
yell
-
8/14/2019 MPEG7-1998feb
5/20
Audio Indexing - Dan Ellis 1998feb04 - 5
CASA survey
Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
Ultimately, more constraints needed
- nonperiodic signals- masked cues
- ambiguous signals
-
8/14/2019 MPEG7-1998feb
6/20
Audio Indexing - Dan Ellis 1998feb04 - 6
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300
400
600
1000
1500
2000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300
400
600
1000
1500
2000
3000
frq/Hzbrn1h.fi.aif
CASA1: Periodic pieces
Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is hidden state
Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnmix.aif/ -
8/14/2019 MPEG7-1998feb
7/20
Audio Indexing - Dan Ellis 1998feb04 - 7
CASA2: Hypothesis systems
Okuno et al. (1994-)
- tracers follow each harmonic + noise agent
- residue-driven: account for whole signal
Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
(a)
TIME
1.0 2.0 3.0 4.0 sec
Buzzer-Alarm
Glass-Clink
Phone-RingSiren-Chirp
420 Hz460 Hz500 Hz
1475 Hz
2230 Hz
2540 Hz
3760 Hz
2350 Hz
1675 Hz
950 Hz
(b)
1.0 2.0 3.0 4.0 sec
TIME
-
8/14/2019 MPEG7-1998feb
8/20
Audio Indexing - Dan Ellis 1998feb04 - 8
CASA3: Other approaches
Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
HMM decomposition (RK Moore)
- recover combined source states directly
Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?
-
8/14/2019 MPEG7-1998feb
9/20
Audio Indexing - Dan Ellis 1998feb04 - 9
input
mixture
signalfeatures
discreteobjects
Front end Objectformation Groupingrules Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeatures
Front end Compare& reconcile
Hypothesismanagement
Predict& combine
Periodiccomponents
Noisecomponents
Prediction-driven CASA
Perception is not direct
but a searchfor plausible hypotheses
Data-driven...
vs. Prediction-driven
Motivations
- detect non-tonal events (noise & clicks)- support restoration illusions...
hooks for high-level knowledge
+ complete explanation, multiple hypotheses,
resynthesis
3
-
8/14/2019 MPEG7-1998feb
10/20
Audio Indexing - Dan Ellis 1998feb04 - 10
1000
2000
4000
f/Hzptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s
Analyzing the continuity illusion
Interrupted tone heard as continuous
- .. if the interruption could be a masker
Data-driven just sees gaps
Prediction-driven can accommodate
- special case or general principle?
http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/ -
8/14/2019 MPEG7-1998feb
11/20
Audio Indexing - Dan Ellis 1998feb04 - 11
Phonemic Restoration (Warren 1970)
Another illusion instance
Inference relies on high-level semantics
Incorporating knowledge into models?
1.2 1.3 1.4 1.5 1.6 1.7 time/s0
500
1000
1500
2000
2500
3000
3500frq/Hz
nsoffee.aif
http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/ -
8/14/2019 MPEG7-1998feb
12/20
Audio Indexing - Dan Ellis 1998feb04 - 12
Subjective ground-truth in mixtures?
Listening tests collect perceived events:
Consistent answers:
S3crash (notcar)
S3closeup car
S31sthorn
S1slam
S4horn1
S4crash
S5Honk
S5Trash can
S5Acceleration
S6double horn
S1honk,honk
S6slam
S6dopplerhorn
S6acceleration
S7gunshot
S7hornS7horn2
S7horn3
S8carhorns
S8carhorns
S8large objectcrash
S8truck engine
S1rev up/passing
S9horn 2
S9doorSlam?
S9horn 5
S10carhorn
S10doorslamming
S10wheels on road
S2firstdouble horn
S2crash
S2horn during crash
S2truck accelerating
200
400
1000
2000
4000
f/HzCity
0 1 2 3 4 5 6 7 8 9
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
-
8/14/2019 MPEG7-1998feb
13/20
Audio Indexing - Dan Ellis 1998feb04 - 13
PDCASA example:
City-street ambience
Problems
- error allocation - rating hypotheses
- source hierarchy - resynthesis
70
60
50
40
dB
200400
100020004000
f/HzNoise1
200400
10002000
4000
f/HzNoise2,Click1
200400
10002000
4000
f/HzCity
0 1 2 3 4 5 6 7 8 9
50100
200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
10002000
4000
f/Hz Wefts14
50100
200400
1000
Weft5 Wefts6,7 Weft8 Wefts912
http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://cityresyn.aif/http://cityorig.aif/ -
8/14/2019 MPEG7-1998feb
14/20
Audio Indexing - Dan Ellis 1998feb04 - 14
Speech recognition
& sound mixtures
Conventional speech recognition:
- signal assumed entirely speech
- find valid labelling by discrete labels
- class models from training data Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space everything speech-like
Very fragile to nonspeech, background
- scene-analysis methods very attractive...
4
Feature
extraction
Phoneme
classifier
HMM
decodersignal wordslow-dim.features
phonemeprobabilities
-
8/14/2019 MPEG7-1998feb
15/20
Audio Indexing - Dan Ellis 1998feb04 - 15
CASA for speech recognition
Data-driven: CASA as preprocessor- problems with holes (but: Cooke, Okuno)
- doesnt exploit knowledge of speech structure
Prediction-driven: speech as component
- same reconciliation of speech hypotheses
- need to express predictions in signal domain
inputmixture
Front end Compare& reconcile
Hypothesismanagement
Predict& combine
Periodiccomponents
Noisecomponents
Speechcomponents
-
8/14/2019 MPEG7-1998feb
16/20
Audio Indexing - Dan Ellis 1998feb04 - 16
Example of speech & nonspeech
Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
5
10
15
f/Bark(a) Clap (clap8kenv.pf)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
(b) Speech plus clap (223clenv.pf)
h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#
h# n ay ntcl
t uw f ay v ow h# s eh v ax n
nine two five oh seven
(d) Reconstruction from labels alone (223clrenvG.pf)
(e) Slowlyvarying portion of original (223clenvg.pf)
(f) Predicted speech element ( = (d)+(e) ) (223clrenv.pf)
(g) Click5 from nonspeech analysis (justclick.pf)
(h) Spurious elements from nonspeech analysis (nonclicks.pf)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
20
40
60
dB
(c) Recognizer output
http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl.aif/ -
8/14/2019 MPEG7-1998feb
17/20
Audio Indexing - Dan Ellis 1998feb04 - 17
Implications for content analysis:
Using CASA to index soundtracks
What are the objects in a soundtrack?
- subjective definition need auditory model
Segmentation vs. classification
- low-level cues locate events
- higher-level learned knowledge to give
semantic label (footstep, crash)
... AI complete?
But: hard to separate
- illusion phenomena suggest auditory
organizationdepends on interpretation
5
200
400
1000
2000
4000
f/Hzcity22
0 1 2 3 4 5 6 7 8 9time/s
70
60
50
40
dB
car noise
horndoor crash
horn
yell
-
8/14/2019 MPEG7-1998feb
18/20
Audio Indexing - Dan Ellis 1998feb04 - 18
Using speech recognition for indexing
Active research area:Access to news broadcast databases
- e.g. Informedia (CMU), ThisL (BBC+...)
- use LV-CSR to transcribe,
then text-retrieval to find- 30-40% word error rate, still works OK
Several systems at NIST TREC workshop
Tricks to ignore nonspeech/poor speech
-
8/14/2019 MPEG7-1998feb
19/20
Audio Indexing - Dan Ellis 1998feb04 - 19
Open issues in automatic indexing
How to do ASA?
Explanation/description hierarchy
- PDCASA: generic primitives
+ constraining hierarchy
- subjective & task-dependent
Classification
- connecting subjective & objective properties
finding subjective invariants, prominence- representation of sound-object classes
Resynthesis?
- a good description should be adequate
- provided in PDCASA, but low quality- requires good knowledge-based constraints
-
8/14/2019 MPEG7-1998feb
20/20
Audio Indexing - Dan Ellis 1998feb04 - 20
Conclusions
Auditory organization is required in real
environments
We dont know how listeners do it!
- plenty of modeling interest Prediction-reconciliation can account for
illusions
- use knowledge when signal is inadequate
- important in a wider range of circumstances?
Speech recognizers are a good source of
knowledge
Automatic indexing implies synthetic listener
- need to solve a lot of modeling issues
6