for speech, sound & multimediadpwe/research/talks/soundcba-2000-04.pdf · speech terms asr +...
TRANSCRIPT
04-04 - 1
Content-based analysis and indexingfor speech, sound & multimedia
CA
Content-based analysis for sound - Dan Ellis 2000-
Dan EllisInternational Computer Science Institute, Berkeley
Outline
About content-based indexing
Related work
An overview of the project
Some specific pieces
Future plans
1
2
3
4
5
04-04 - 2
About content-based indexing
rchives
1
?
User
Content-based analysis for sound - Dan Ellis 2000-
• Problem: Automating search in large a
• “Information retrieval” (IR)
• E.g.:- searching the web- searching broadcast archives- automatic monitoring...
Vast archive
Searchengine
04-04 - 3
Varieties of Information Retrieval (IR)
res + video)
space”)
tions
ty metrics
ty metrics
rity
ping
Content-based analysis for sound - Dan Ellis 2000-
• Many different search situations:
- plus combinations (e.g. sound mixtu
Archive Queries Technology
Text terms Text IR (tf • idf, “term
Speech terms ASR + Text IR
Multimedia terms Text IR on annota
Images, video
examples/sketches
Global image similari
Soundexamples/categories
Global sound similari
Sound mixtures
examples object-based simila
terms term-to-feature map
04-04 - 4
Content analysis of sound mixtures
jects
nce
time/s
frq/Hz
Content-based analysis for sound - Dan Ellis 2000-
• Use local features to find individual ob
• Objects must mirror subjective experie
0 2 4 6 8 10 120
2000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
InformationRetrieval
Analysis
Query
04-04 - 5
Outline
Content-based analysis for sound - Dan Ellis 2000-
About content-based retrieval
Related work- Text-based IR- Spoken document retrieval- Image and video retrieval- Multimedia systems- Sound effects indexing- MPEG7 ‘metadata’
An overview of the project
Some specific pieces
Future plans
1
2
3
4
5
04-04 - 6
Text-based IR
quency
rchable
eal terms
2
Content-based analysis for sound - Dan Ellis 2000-
• e.g. Web search engines
• Metric:term frequency • inverse document fre- emphasizes unusual words- distances in Euclidean ‘term space’
• Decomposition of documents into seaatoms is (almost) trivial- words are easily isolated, close to id- some problems, hence stemming
04-04 - 7
Spoken document retrieval
ings:n
ting
0.5
→
0.4
put!
Content-based analysis for sound - Dan Ellis 2000-
• Information retrieval for speech recordConvert to text with speech recognitio- e.g. Thisl project (news broadcasts)
• Speech recognition errors not the limifactor- TREC-98 results: average precision
• Output should be original audio- best not to show the recognizer out
04-04 - 8
Image and video retrieval
(IBM 1995)
quency?
Content-based analysis for sound - Dan Ellis 2000-
• e.g. Query By Image Content (QBIC) - templates, color, texture
• VideoQ (Columbia 1999)- sketching for images and video- color, shape, size, position, motion
• Image ‘objects’?- analog of terms in text- acquired by unsupervised clustering- object frequency • inverse image fre
04-04 - 9
Multimedia systems
• Informedia (CMU, 1996-)
+ IR
s shows) sources
Content-based analysis for sound - Dan Ellis 2000-
- ASR + video cuts + OCR of screen
• AT&T multilevel structuring- exploit knowledge of genre (TV new- multiple special-purpose information
4-04 - 10
Sound effects indexing
imensionsixtures
Content-based analysis for sound - Dan Ellis 2000-0
• Muscle Fish “SoundFisher”- browser for sound-effects archives- define multiple ‘perceptual’ feature d- no attempt to separate objects in m
4-04 - 11
MPEG-7 ‘Metadata’
pression
search
metadata
:scription
tegories
nt
Content-based analysis for sound - Dan Ellis 2000-0
• MPEG is known for audio/video comstandards; also developing standards for use in and indexing
• MPEG-7 will be a standard format for Well-defined categories for content de
• Mostly just framework, some actual ca
• How to derive descriptors from conteis not specified
4-04 - 12
Outline
Content-based analysis for sound - Dan Ellis 2000-0
About content-based retrieval
Related work
An overview of the project- Boundaries & structuring- Query forms- Summarization- Evaluation
Some specific pieces
Future plans
1
2
3
4
5
4-04 - 13
Audio-video content-based retrieval:
tion
3
alogic
results
symbolic
IRengine
Query
Content-based analysis for sound - Dan Ellis 2000-0
System overview
• Fusion of audio + video (+...?) informa
• Different query forms
audio
video
Audio-videodata an
Speechprocessing
Nonspeechanalysis
Videofeatures
A-Vmatch
Boundaries &structuring
Indexingterms
Summarization
4-04 - 14
Boundaries and structuring
change
s
n , sports)
nts
Content-based analysis for sound - Dan Ellis 2000-0
• Multimedia documents lack structure
• Changes relatively easy to detect- if we don’t have to characterize the
• Audio and video are complementary
• Boundaries define structure e.g. storie
• May be able to identify genre based ostructure pattern (TV, news, interviews- notice repetition of particular segme
(title sequences, commercials etc.)
4-04 - 15
Forms of query
tures?ies
ments
h are salient
Content-based analysis for sound - Dan Ellis 2000-0
• Traditional term-based- mapping of terms to audio/video fea- ... plus all the usual lexical ambiguit- literal vs. thematic terms
• Similarity e.g. by example- easy once you have initial hits/docu- but: which aspects of the example?
• User-provided example e.g. a ‘sketch’- better idea of which parts of a sketc
and which to ignore- audio sketches?- spoken words?
4-04 - 16
Summarization of results
ience &
ues
Content-based analysis for sound - Dan Ellis 2000-0
• Multimedia ‘hits’ are hard to present- multi-media → many aspects- some are intrinsically temporal
• Video presentation- salient stills/story board- sped-up video
• Spoken content- textual summarization based on sal
recognizer confidence- audio selection based on prosodic c
• Audio content- choosing ‘distinctive’ events- visual representation?- timescale modification?
4-04 - 17
Evaluation
lly is
queries
lly ask?
Content-based analysis for sound - Dan Ellis 2000-0
• Multimedia IR is an emerging field- no consensus on what the task rea- no common evaluation metrics
• Evaluation is critical- sanity check on progress- affords ‘fundability’
• How to do it?- quantitative tests e.g. datasets and - qualitative user evaluation
• Prototype demos- de rigueur...- also provide input to design:
what kind of queries will people rea
4-04 - 18
Outline
Content-based analysis for sound - Dan Ellis 2000-0
About content-based retrieval
Related work
An overview of the project
Some specific pieces- Object-based audio analysis- Speech recognition for retrieval- Music processing- Machine learning of terms
Future plans
1
2
3
4
5
4-04 - 19
Object-based audio analysis:lysis
quency
set &c.)
..
4
1
1.5 2 2.5
Content-based analysis for sound - Dan Ellis 2000-0
Computational Auditory Scene Ana
• Deconstructing sound mixtures
- representation of energy in time-fre- formation of atomic elements- grouping by common properties (on
• Ambiguous/noisy sounds need more.- top-down constraints- multiple alternative hypotheses
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000
10
1000
2000
3000
4000
5000
4-04 - 20
Retrieval of sound objects
res:
:
d subsets
Results
Seach/comparison
Results
Content-based analysis for sound - Dan Ellis 2000-0
• Muscle Fish system uses global featu
• Mixtures → need elements & objects
- features are calculated over groupe
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Query example
Feature vectors
Genericelementanalysis
Continuous audioarchive
Genericelementanalysis
Query example
Element representations
Objectformation
(Objectformation)
Word-to-classmapping
Objects + properties
Properties aloneSymbolic query
rushing water
4-04 - 21
Speech recognition for retrieval
otheses
Content-based analysis for sound - Dan Ellis 2000-0
• Words are not enough;Confidence-tagged alternate word hyp
• Other useful information:- speaker change detection- speaker characterization- phrasing & timing- prosodic features
• Integration with other analyses- segmentation for adaptation- nonspeech events to ignore- video-derived information?
4-04 - 22
Music processing
se
on
Content-based analysis for sound - Dan Ellis 2000-0
• Music is a highly-structured special ca
• Need to detect it at the least
• Algorithms to extract special informati- melody, harmony, rhythm- instrument identification- genre classification
• Body of existing research...
4-04 - 23
Machine learning of patterns & terms
d training )? patterns
allel:seaning’ in
very
mples
ype tion etc.)apping
Content-based analysis for sound - Dan Ellis 2000-0
• What can you do with a large unlabeleset (e.g. multimedia clips from the web- bootstrap learning: look for common- have to learn generalizations in par
e.g. self-organizing maps, EM HMM- post-filtering by humans may find ‘m
clusters
• Associated text annotations provide asmall amount of labeling- .. but for a very large number of exa
– sufficient to obtain purchase?- maximize label utility through NLP-t
operations (expansion, disambigua- goal is automatic term-to-feature m
for term-based content queries
4-04 - 24
Outline
Content-based analysis for sound - Dan Ellis 2000-0
About content-based retrieval
Background technologies
An overview of the project
Some specific pieces
Future plans
1
2
3
4
5
4-04 - 25
Future plans
ith Zakhor)
tures
sis
5
Content-based analysis for sound - Dan Ellis 2000-0
• Obtain funding:- Thisl follow-on with the EU?- NSF: sound IR, also audio-video (w- other sources?
• Choose a task and an archive- multimedia clips on the web- existing archives e.g. taped UCB lec- speech/broadcast archives- meeting recorder
• Begin developing features- computational auditory scene analy- .. need to apply to large corpora
• Online demo ASAP?- to help clarify the problem