mpeg7-1998feb

Upload: mcastilho

Post on 30-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 MPEG7-1998feb

    1/20

    Audio Indexing - Dan Ellis 1998feb04 - 1

    Automatic audio analysis

    for content description & indexing

    Dan Ellis

    International Computer Science Institute, Berkeley CA

    Outline

    Auditory Scene Analysis (ASA)

    Computational ASA (CASA)

    Prediction-driven CASA

    Speech recognition & sound mixtures

    Implications for content analysis

    1

    2

    3

    4

    5

  • 8/14/2019 MPEG7-1998feb

    2/20

    Audio Indexing - Dan Ellis 1998feb04 - 2

    200

    400

    1000

    2000

    4000

    f/Hzcity22

    0 1 2 3 4 5 6 7 8 9time/s

    70

    60

    50

    40

    dB

    Auditory Scene Analysis

    The organization of complex sound scenes

    according to their inferred sources

    Sounds rarely occur in isolation

    - organization required for useful information

    Human audition is very effective

    - unexpectedly difficult to model

    Correct analysis defined by goal

    - source shows independence, continuity

    ecological constraints enable organization

    1

    http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/
  • 8/14/2019 MPEG7-1998feb

    3/20

    Audio Indexing - Dan Ellis 1998feb04 - 3

    Psychology of ASA

    Extensive experimental research

    - organization of simple pieces

    (sinusoids & white noise)

    - streaming, pitch perception, double vowels

    Auditory Scene Analysis [Bregman 1990]

    grouping rules

    - common onset/offset/modulation,

    harmonicity, spatial location

    Debated... (Darwin, Carlyon, Moore, Remez)

    (from

    Darwin 1996)

  • 8/14/2019 MPEG7-1998feb

    4/20

    Audio Indexing - Dan Ellis 1998feb04 - 4

    Computational Auditory Scene Analysis

    (CASA)

    Automatic sound organization?

    - convert an undifferentiated signal into a

    description in terms of different sources

    Translate psych. rules into programs?

    - representations to reveal common onset,

    harmonicity ...

    Motivations & Applications

    - its a puzzle: new processing principles?

    - real-world interactive systems (speech, robots)

    - hearing prostheses (enhancement, description)

    - advanced processing (remixing)

    - multimedia indexing...

    2

    200

    400

    1000

    2000

    4000

    f/Hzcity22

    0 1 2 3 4 5 6 7 8 9time/s

    70

    60

    50

    40

    dB

    car noise

    horndoor crash

    horn

    yell

  • 8/14/2019 MPEG7-1998feb

    5/20

    Audio Indexing - Dan Ellis 1998feb04 - 5

    CASA survey

    Early work on co-channel speech

    - listeners benefit from pitch difference

    - algorithms for separating periodicities

    Utterance-sized signals need more

    - cannot predict number of signals (0, 1, 2 ...)

    - birth/death processes

    Ultimately, more constraints needed

    - nonperiodic signals- masked cues

    - ambiguous signals

  • 8/14/2019 MPEG7-1998feb

    6/20

    Audio Indexing - Dan Ellis 1998feb04 - 6

    0.2 0.4 0.6 0.8 1.0 time/s

    100

    150200

    300

    400

    600

    1000

    1500

    2000

    3000

    frq/Hzbrn1h.aif

    0.2 0.4 0.6 0.8 1.0 time/s

    100

    150200

    300

    400

    600

    1000

    1500

    2000

    3000

    frq/Hzbrn1h.fi.aif

    CASA1: Periodic pieces

    Weintraub 1985

    - separate male & female voices

    - find periodicities in each frequency channel by

    auto-coincidence

    - number of voices is hidden state

    Cooke & Brown (1991-3)

    - divide time-frequency plane into elements

    - apply grouping rules to form sources

    - pull single periodic target out of noise

    http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnmix.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnfilt.aif/http://brnmix.aif/
  • 8/14/2019 MPEG7-1998feb

    7/20

    Audio Indexing - Dan Ellis 1998feb04 - 7

    CASA2: Hypothesis systems

    Okuno et al. (1994-)

    - tracers follow each harmonic + noise agent

    - residue-driven: account for whole signal

    Klassner 1996

    - search for a combination of templates

    - high-level hypotheses permit front-end tuning

    Ellis 1996

    - model for events perceived in dense scenes

    - prediction-driven: observations - hypotheses

    (a)

    TIME

    1.0 2.0 3.0 4.0 sec

    Buzzer-Alarm

    Glass-Clink

    Phone-RingSiren-Chirp

    420 Hz460 Hz500 Hz

    1475 Hz

    2230 Hz

    2540 Hz

    3760 Hz

    2350 Hz

    1675 Hz

    950 Hz

    (b)

    1.0 2.0 3.0 4.0 sec

    TIME

  • 8/14/2019 MPEG7-1998feb

    8/20

    Audio Indexing - Dan Ellis 1998feb04 - 8

    CASA3: Other approaches

    Blind source separation (Bell & Sejnowski)

    - find exact separation parameters by maximizing

    statistic e.g. signal independence

    HMM decomposition (RK Moore)

    - recover combined source states directly

    Neural models (Malsburg, Wang & Brown)

    - avoid implausible AI methods (search, lists)

    - oscillators substitute for iteration?

  • 8/14/2019 MPEG7-1998feb

    9/20

    Audio Indexing - Dan Ellis 1998feb04 - 9

    input

    mixture

    signalfeatures

    discreteobjects

    Front end Objectformation Groupingrules Sourcegroups

    inputmixture

    signalfeatures

    predictionerrors

    hypotheses

    predictedfeatures

    Front end Compare& reconcile

    Hypothesismanagement

    Predict& combine

    Periodiccomponents

    Noisecomponents

    Prediction-driven CASA

    Perception is not direct

    but a searchfor plausible hypotheses

    Data-driven...

    vs. Prediction-driven

    Motivations

    - detect non-tonal events (noise & clicks)- support restoration illusions...

    hooks for high-level knowledge

    + complete explanation, multiple hypotheses,

    resynthesis

    3

  • 8/14/2019 MPEG7-1998feb

    10/20

    Audio Indexing - Dan Ellis 1998feb04 - 10

    1000

    2000

    4000

    f/Hzptshort

    0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s

    Analyzing the continuity illusion

    Interrupted tone heard as continuous

    - .. if the interruption could be a masker

    Data-driven just sees gaps

    Prediction-driven can accommodate

    - special case or general principle?

    http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/http://ptvshort.aif/
  • 8/14/2019 MPEG7-1998feb

    11/20

    Audio Indexing - Dan Ellis 1998feb04 - 11

    Phonemic Restoration (Warren 1970)

    Another illusion instance

    Inference relies on high-level semantics

    Incorporating knowledge into models?

    1.2 1.3 1.4 1.5 1.6 1.7 time/s0

    500

    1000

    1500

    2000

    2500

    3000

    3500frq/Hz

    nsoffee.aif

    http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/http://nsoffee.aif/
  • 8/14/2019 MPEG7-1998feb

    12/20

    Audio Indexing - Dan Ellis 1998feb04 - 12

    Subjective ground-truth in mixtures?

    Listening tests collect perceived events:

    Consistent answers:

    S3crash (notcar)

    S3closeup car

    S31sthorn

    S1slam

    S4horn1

    S4crash

    S5Honk

    S5Trash can

    S5Acceleration

    S6double horn

    S1honk,honk

    S6slam

    S6dopplerhorn

    S6acceleration

    S7gunshot

    S7hornS7horn2

    S7horn3

    S8carhorns

    S8carhorns

    S8large objectcrash

    S8truck engine

    S1rev up/passing

    S9horn 2

    S9doorSlam?

    S9horn 5

    S10carhorn

    S10doorslamming

    S10wheels on road

    S2firstdouble horn

    S2crash

    S2horn during crash

    S2truck accelerating

    200

    400

    1000

    2000

    4000

    f/HzCity

    0 1 2 3 4 5 6 7 8 9

    Horn1 (10/10)

    Crash (10/10)

    Horn2 (5/10)

    Truck (7/10)

  • 8/14/2019 MPEG7-1998feb

    13/20

    Audio Indexing - Dan Ellis 1998feb04 - 13

    PDCASA example:

    City-street ambience

    Problems

    - error allocation - rating hypotheses

    - source hierarchy - resynthesis

    70

    60

    50

    40

    dB

    200400

    100020004000

    f/HzNoise1

    200400

    10002000

    4000

    f/HzNoise2,Click1

    200400

    10002000

    4000

    f/HzCity

    0 1 2 3 4 5 6 7 8 9

    50100

    200400

    1000

    Horn1 (10/10)

    Crash (10/10)

    Horn2 (5/10)

    Truck (7/10)

    Horn3 (5/10)

    Squeal (6/10)

    Horn4 (8/10)Horn5 (10/10)

    0 1 2 3 4 5 6 7 8 9time/s

    200400

    10002000

    4000

    f/Hz Wefts14

    50100

    200400

    1000

    Weft5 Wefts6,7 Weft8 Wefts912

    http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityorig.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://cityresyn.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://citywefts.aif/http://cityresyn.aif/http://cityorig.aif/
  • 8/14/2019 MPEG7-1998feb

    14/20

    Audio Indexing - Dan Ellis 1998feb04 - 14

    Speech recognition

    & sound mixtures

    Conventional speech recognition:

    - signal assumed entirely speech

    - find valid labelling by discrete labels

    - class models from training data Some problems:

    - need to ignore lexically-irrelevant variation

    (microphone, voice pitch etc.)

    - compact feature space everything speech-like

    Very fragile to nonspeech, background

    - scene-analysis methods very attractive...

    4

    Feature

    extraction

    Phoneme

    classifier

    HMM

    decodersignal wordslow-dim.features

    phonemeprobabilities

  • 8/14/2019 MPEG7-1998feb

    15/20

    Audio Indexing - Dan Ellis 1998feb04 - 15

    CASA for speech recognition

    Data-driven: CASA as preprocessor- problems with holes (but: Cooke, Okuno)

    - doesnt exploit knowledge of speech structure

    Prediction-driven: speech as component

    - same reconciliation of speech hypotheses

    - need to express predictions in signal domain

    inputmixture

    Front end Compare& reconcile

    Hypothesismanagement

    Predict& combine

    Periodiccomponents

    Noisecomponents

    Speechcomponents

  • 8/14/2019 MPEG7-1998feb

    16/20

    Audio Indexing - Dan Ellis 1998feb04 - 16

    Example of speech & nonspeech

    Problems:

    - undoing classification & normalization

    - finding a starting hypothesis

    - granularity of integration

    5

    10

    15

    f/Bark(a) Clap (clap8kenv.pf)

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

    (b) Speech plus clap (223clenv.pf)

    h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#

    h# n ay ntcl

    t uw f ay v ow h# s eh v ax n

    nine two five oh seven

    (d) Reconstruction from labels alone (223clrenvG.pf)

    (e) Slowlyvarying portion of original (223clenvg.pf)

    (f) Predicted speech element ( = (d)+(e) ) (223clrenv.pf)

    (g) Click5 from nonspeech analysis (justclick.pf)

    (h) Spurious elements from nonspeech analysis (nonclicks.pf)

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

    20

    40

    60

    dB

    (c) Recognizer output

    http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl-renv.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl.aif/http://223cl-renv.aif/http://223cl.aif/
  • 8/14/2019 MPEG7-1998feb

    17/20

    Audio Indexing - Dan Ellis 1998feb04 - 17

    Implications for content analysis:

    Using CASA to index soundtracks

    What are the objects in a soundtrack?

    - subjective definition need auditory model

    Segmentation vs. classification

    - low-level cues locate events

    - higher-level learned knowledge to give

    semantic label (footstep, crash)

    ... AI complete?

    But: hard to separate

    - illusion phenomena suggest auditory

    organizationdepends on interpretation

    5

    200

    400

    1000

    2000

    4000

    f/Hzcity22

    0 1 2 3 4 5 6 7 8 9time/s

    70

    60

    50

    40

    dB

    car noise

    horndoor crash

    horn

    yell

  • 8/14/2019 MPEG7-1998feb

    18/20

    Audio Indexing - Dan Ellis 1998feb04 - 18

    Using speech recognition for indexing

    Active research area:Access to news broadcast databases

    - e.g. Informedia (CMU), ThisL (BBC+...)

    - use LV-CSR to transcribe,

    then text-retrieval to find- 30-40% word error rate, still works OK

    Several systems at NIST TREC workshop

    Tricks to ignore nonspeech/poor speech

  • 8/14/2019 MPEG7-1998feb

    19/20

    Audio Indexing - Dan Ellis 1998feb04 - 19

    Open issues in automatic indexing

    How to do ASA?

    Explanation/description hierarchy

    - PDCASA: generic primitives

    + constraining hierarchy

    - subjective & task-dependent

    Classification

    - connecting subjective & objective properties

    finding subjective invariants, prominence- representation of sound-object classes

    Resynthesis?

    - a good description should be adequate

    - provided in PDCASA, but low quality- requires good knowledge-based constraints

  • 8/14/2019 MPEG7-1998feb

    20/20

    Audio Indexing - Dan Ellis 1998feb04 - 20

    Conclusions

    Auditory organization is required in real

    environments

    We dont know how listeners do it!

    - plenty of modeling interest Prediction-reconciliation can account for

    illusions

    - use knowledge when signal is inadequate

    - important in a wider range of circumstances?

    Speech recognizers are a good source of

    knowledge

    Automatic indexing implies synthetic listener

    - need to solve a lot of modeling issues

    6