for speech, sound & multimediadpwe/research/talks/soundcba-2000-04.pdf · speech terms asr +...

25
Content-based analysis for sound - Dan Ellis 2000-04-04 - 1 Content-based analysis and indexing for speech, sound & multimedia Dan Ellis International Computer Science Institute, Berkeley CA <[email protected]> Outline About content-based indexing Related work An overview of the project Some specific pieces Future plans 1 2 3 4 5

Upload: others

Post on 30-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 1

Content-based analysis and indexingfor speech, sound & multimedia

CA

Content-based analysis for sound - Dan Ellis 2000-

Dan EllisInternational Computer Science Institute, Berkeley

<[email protected]>

Outline

About content-based indexing

Related work

An overview of the project

Some specific pieces

Future plans

1

2

3

4

5

Page 2: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 2

About content-based indexing

rchives

1

?

User

Content-based analysis for sound - Dan Ellis 2000-

• Problem: Automating search in large a

• “Information retrieval” (IR)

• E.g.:- searching the web- searching broadcast archives- automatic monitoring...

Vast archive

Searchengine

Page 3: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 3

Varieties of Information Retrieval (IR)

res + video)

space”)

tions

ty metrics

ty metrics

rity

ping

Content-based analysis for sound - Dan Ellis 2000-

• Many different search situations:

- plus combinations (e.g. sound mixtu

Archive Queries Technology

Text terms Text IR (tf • idf, “term

Speech terms ASR + Text IR

Multimedia terms Text IR on annota

Images, video

examples/sketches

Global image similari

Soundexamples/categories

Global sound similari

Sound mixtures

examples object-based simila

terms term-to-feature map

Page 4: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 4

Content analysis of sound mixtures

jects

nce

time/s

frq/Hz

Content-based analysis for sound - Dan Ellis 2000-

• Use local features to find individual ob

• Objects must mirror subjective experie

0 2 4 6 8 10 120

2000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

InformationRetrieval

Analysis

Query

Page 5: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 5

Outline

Content-based analysis for sound - Dan Ellis 2000-

About content-based retrieval

Related work- Text-based IR- Spoken document retrieval- Image and video retrieval- Multimedia systems- Sound effects indexing- MPEG7 ‘metadata’

An overview of the project

Some specific pieces

Future plans

1

2

3

4

5

Page 6: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 6

Text-based IR

quency

rchable

eal terms

2

Content-based analysis for sound - Dan Ellis 2000-

• e.g. Web search engines

• Metric:term frequency • inverse document fre- emphasizes unusual words- distances in Euclidean ‘term space’

• Decomposition of documents into seaatoms is (almost) trivial- words are easily isolated, close to id- some problems, hence stemming

Page 7: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 7

Spoken document retrieval

ings:n

ting

0.5

0.4

put!

Content-based analysis for sound - Dan Ellis 2000-

• Information retrieval for speech recordConvert to text with speech recognitio- e.g. Thisl project (news broadcasts)

• Speech recognition errors not the limifactor- TREC-98 results: average precision

• Output should be original audio- best not to show the recognizer out

Page 8: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 8

Image and video retrieval

(IBM 1995)

quency?

Content-based analysis for sound - Dan Ellis 2000-

• e.g. Query By Image Content (QBIC) - templates, color, texture

• VideoQ (Columbia 1999)- sketching for images and video- color, shape, size, position, motion

• Image ‘objects’?- analog of terms in text- acquired by unsupervised clustering- object frequency • inverse image fre

Page 9: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

04-04 - 9

Multimedia systems

• Informedia (CMU, 1996-)

+ IR

s shows) sources

Content-based analysis for sound - Dan Ellis 2000-

- ASR + video cuts + OCR of screen

• AT&T multilevel structuring- exploit knowledge of genre (TV new- multiple special-purpose information

Page 10: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 10

Sound effects indexing

imensionsixtures

Content-based analysis for sound - Dan Ellis 2000-0

• Muscle Fish “SoundFisher”- browser for sound-effects archives- define multiple ‘perceptual’ feature d- no attempt to separate objects in m

Page 11: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 11

MPEG-7 ‘Metadata’

pression

search

metadata

:scription

tegories

nt

Content-based analysis for sound - Dan Ellis 2000-0

• MPEG is known for audio/video comstandards; also developing standards for use in and indexing

• MPEG-7 will be a standard format for Well-defined categories for content de

• Mostly just framework, some actual ca

• How to derive descriptors from conteis not specified

Page 12: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 12

Outline

Content-based analysis for sound - Dan Ellis 2000-0

About content-based retrieval

Related work

An overview of the project- Boundaries & structuring- Query forms- Summarization- Evaluation

Some specific pieces

Future plans

1

2

3

4

5

Page 13: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 13

Audio-video content-based retrieval:

tion

3

alogic

results

symbolic

IRengine

Query

Content-based analysis for sound - Dan Ellis 2000-0

System overview

• Fusion of audio + video (+...?) informa

• Different query forms

audio

video

Audio-videodata an

Speechprocessing

Nonspeechanalysis

Videofeatures

A-Vmatch

Boundaries &structuring

Indexingterms

Summarization

Page 14: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 14

Boundaries and structuring

change

s

n , sports)

nts

Content-based analysis for sound - Dan Ellis 2000-0

• Multimedia documents lack structure

• Changes relatively easy to detect- if we don’t have to characterize the

• Audio and video are complementary

• Boundaries define structure e.g. storie

• May be able to identify genre based ostructure pattern (TV, news, interviews- notice repetition of particular segme

(title sequences, commercials etc.)

Page 15: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 15

Forms of query

tures?ies

ments

h are salient

Content-based analysis for sound - Dan Ellis 2000-0

• Traditional term-based- mapping of terms to audio/video fea- ... plus all the usual lexical ambiguit- literal vs. thematic terms

• Similarity e.g. by example- easy once you have initial hits/docu- but: which aspects of the example?

• User-provided example e.g. a ‘sketch’- better idea of which parts of a sketc

and which to ignore- audio sketches?- spoken words?

Page 16: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 16

Summarization of results

ience &

ues

Content-based analysis for sound - Dan Ellis 2000-0

• Multimedia ‘hits’ are hard to present- multi-media → many aspects- some are intrinsically temporal

• Video presentation- salient stills/story board- sped-up video

• Spoken content- textual summarization based on sal

recognizer confidence- audio selection based on prosodic c

• Audio content- choosing ‘distinctive’ events- visual representation?- timescale modification?

Page 17: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 17

Evaluation

lly is

queries

lly ask?

Content-based analysis for sound - Dan Ellis 2000-0

• Multimedia IR is an emerging field- no consensus on what the task rea- no common evaluation metrics

• Evaluation is critical- sanity check on progress- affords ‘fundability’

• How to do it?- quantitative tests e.g. datasets and - qualitative user evaluation

• Prototype demos- de rigueur...- also provide input to design:

what kind of queries will people rea

Page 18: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 18

Outline

Content-based analysis for sound - Dan Ellis 2000-0

About content-based retrieval

Related work

An overview of the project

Some specific pieces- Object-based audio analysis- Speech recognition for retrieval- Music processing- Machine learning of terms

Future plans

1

2

3

4

5

Page 19: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 19

Object-based audio analysis:lysis

quency

set &c.)

..

4

1

1.5 2 2.5

Content-based analysis for sound - Dan Ellis 2000-0

Computational Auditory Scene Ana

• Deconstructing sound mixtures

- representation of energy in time-fre- formation of atomic elements- grouping by common properties (on

• Ambiguous/noisy sounds need more.- top-down constraints- multiple alternative hypotheses

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000

10

1000

2000

3000

4000

5000

Page 20: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 20

Retrieval of sound objects

res:

:

d subsets

Results

Seach/comparison

Results

Content-based analysis for sound - Dan Ellis 2000-0

• Muscle Fish system uses global featu

• Mixtures → need elements & objects

- features are calculated over groupe

Segmentfeatureanalysis

Sound segmentdatabase

Segmentfeatureanalysis

Seach/comparison

Query example

Feature vectors

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water

Page 21: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 21

Speech recognition for retrieval

otheses

Content-based analysis for sound - Dan Ellis 2000-0

• Words are not enough;Confidence-tagged alternate word hyp

• Other useful information:- speaker change detection- speaker characterization- phrasing & timing- prosodic features

• Integration with other analyses- segmentation for adaptation- nonspeech events to ignore- video-derived information?

Page 22: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 22

Music processing

se

on

Content-based analysis for sound - Dan Ellis 2000-0

• Music is a highly-structured special ca

• Need to detect it at the least

• Algorithms to extract special informati- melody, harmony, rhythm- instrument identification- genre classification

• Body of existing research...

Page 23: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 23

Machine learning of patterns & terms

d training )? patterns

allel:seaning’ in

very

mples

ype tion etc.)apping

Content-based analysis for sound - Dan Ellis 2000-0

• What can you do with a large unlabeleset (e.g. multimedia clips from the web- bootstrap learning: look for common- have to learn generalizations in par

e.g. self-organizing maps, EM HMM- post-filtering by humans may find ‘m

clusters

• Associated text annotations provide asmall amount of labeling- .. but for a very large number of exa

– sufficient to obtain purchase?- maximize label utility through NLP-t

operations (expansion, disambigua- goal is automatic term-to-feature m

for term-based content queries

Page 24: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 24

Outline

Content-based analysis for sound - Dan Ellis 2000-0

About content-based retrieval

Background technologies

An overview of the project

Some specific pieces

Future plans

1

2

3

4

5

Page 25: for speech, sound & multimediadpwe/research/talks/SoundCBA-2000-04.pdf · Speech terms ASR + Text IR Multimedia terms Text IR on annotations Images, video examples/ sketches Global

4-04 - 25

Future plans

ith Zakhor)

tures

sis

5

Content-based analysis for sound - Dan Ellis 2000-0

• Obtain funding:- Thisl follow-on with the EU?- NSF: sound IR, also audio-video (w- other sources?

• Choose a task and an archive- multimedia clips on the web- existing archives e.g. taped UCB lec- speech/broadcast archives- meeting recorder

• Begin developing features- computational auditory scene analy- .. need to apply to large corpora

• Online demo ASAP?- to help clarify the problem