Outline
• Multimedia content description Interface (MPEG-7)
• Video content features• Spoken content features• Multimedia indexing, and retrieval• Multimedia summary, filtering• Other applications
MPEG-7 Overview
• Large amount of digital contents are available
• Easy to create, digitize, and distribute audio-visual content
• Family album syndrome– Need organize, index,
retrieval
• Information overloading– Need filtering
• MPEG-7 ObjectiveProvide inter-operability among systems and applications used in generation, management, distribution, and consumption of audio-visual content description.
Help user to identify, retrieve, or filter audio-video information.
Potential Application of MPEG-7
• Summary, – Generation of multimedia
program guide or content summary
– Generation of content description of A/V archive to allow seamless exchange among content creator, aggregator, and consumer.
• Filtering– Filter and transform
multimedia streams in resource limited environment by matching user preference, available resource and content description.
• Retrieval– Recall music using samples
of tunes– Recall pictures using
sketches of shape, color movement, description of scenario
• Recommendation– Recommend program
materials by matching user preference (profile) to program content
• Indexing– Create family photo or video
library index
Content descriptions
• Descriptors – MPEG-7 contains
standardized descriptors for audio, visual, generic contents.
– Standardize how these content features are being characterized, but not how to extract.
– Different levels of syntax and semantic descriptions are available
• Description Scheme– Specify the structure and
relations among different A/V descriptors
• Description Definition Language (DDL)– Standardized language
based on XML (eXtended Markup Language) for defining new Ds and DSs; extending or modifying existing Ds and Dss.
Visual Color Descriptors
• Color space: HSV (hue-saturation-value)– Scalable color descriptor
(SCD): color histogram (uniform 255 bin) of an image in HSV encoded by Haar transform.
• Color layout descriptor: – spatial distribution of
color in an arbitrarily shaped region.
• Dominant color descriptor (DCD): – colors are clustered first.
• Color structure descriptor (CSD): – scan 8x8 block in slide
window, and count particular color in window.
• Group of Frame/Group of Picture color descriptor
Visual Texture Descriptor
• Texture Browsing D.– Regularity:
• 0: irregular; 3: periodic
– Directionality• Up to 2 directions• 1-6 in 30O increment
– Coarseness• 0: fine; 3: coarse
• Edge histogram D.– 16 sub-images– 5 (edge direction)
bins/sub-image
• Homogeneous Texture D. (HTD)– Divide frequency space
into 30 bins (5 radial, 6 angular)
– 2D Gabor filter bank applied to each bin
– Energy and energy deviation in each bin computed to form descriptor.
Visual Shape Descriptor
• 3D Shape D. – Shape spectrum– Histogram (100 bins,
12bits/bin) of a shape index, computed over 3D surface.
– Each shape index measures local convexity.
• Region-based D.: Art– Angular radial transform– Shape analysis based on
moments– ART basis:
Vnm(, ) = exp(jm)Rn()
Rn() = 2 cos(n) n 0 = 1 n = 0
• Contour based shape descriptor– Curvature scale space
(CSS)– N points/curve, successively
smoothed by [0.25 0.5 0.25] till curve become convex.
– Curvature at each point form a curvature at that scale.
– Peaks of each scale are used as feature
• 2D/3D descriptors– Use multiple 2D descriptors
to describe 3D shape
Visual Motion Descriptor• Motion activity D.
– Intensity– Direction of activity– Spatial distribution of activity– Temporal distribution of
activity
• Camera motion– Panning– Booming (lift up)– Tracking– Tilting– Zooming– Rolling (around image
center)– Dollying (backward)
• Warping (w.r.t. mosaic)• Motion trajectory
Videosegment
Camera motion
Motion activity
Mosaic
Warping parameter
Motionregion
trajectory
Parametricmotion
MPEG-7 Audio Content Descriptors
• 4 classes of audio signals– Pure music– Pure speech– Pure sound effect– Arbitrary sound track
• Audio descriptors– Silence Ds: silencetype– Sound effect Ds:
• Audio Spectrum
• Sound effect features
– Spoken content Ds:• Speaker type• Link type• Extraction info type• Confusion info type
– Timbre Ds:• Instrument • Harmonic instrument • Percussive instrument
– Melody contour Ds• Contour• Meter• beat
Spoken content description
Goal: To support potentially erroneous decoding extracted using an automatic speech recognition system for robust retrieval.
• Spoken content Header– Word lexicon (vocabulary)– Phone lexicon:
• IPA (international phonetic association. Alphabet)• SAMPA (speech assessment method phonetic
alphabet)
– Phone confusion statistics– Speaker
• Spoken content lattice (word or phone)– Lattice Node– Word and phone link
Audioprocessing
ASR MPEG-7Encoder
Speechwaveform
lattice
Header
lattice
BOREP=0.6
ISP=0.7
HISP=0.3
Use of Content Features
• Multimedia information retrieval– Create searchable
archive of A/V materials, e.g. album, digital library
– Real world examples: • call routing
• Technical support
• On-line manual
• Shopping
• Multimedia on demand
• Filtering– Automated email sorter– Personalized information
portal
• Enhance low-level signal processing– Coding and trans-coding– Post-processing
Content-based Retrieval
Query Module
InteractiveQuery
Formation
Featureextraction
User
Retrieval Module
Browsing&
Feedback
Feature comparison
Output
ImageDatabase
Featureextraction
Input Module
FeatureDatabase
Multimediadata
Multimedia CBR System Design Issues
• Requirement analysis– How the multimedia materials are to be used
– Determines what set of features are needed.
• Archiving– How should individual objects are stored? Granularity?
• Indexing (query) and retrieving– With multi-dimensional indices, what is an effective and efficient
retrieval method?
– What is a suitable perceptually-consistent similarity measure?
• User interface– Modality? Text or spoken language or others?
– Interactive or batch? Will dialogue be available?
Multimedia Archiving
• Facts:– Often in compressed format and needs large
storage space– Content index will also occupy storage space
• Issues– Granularity must match underlying file system – Logical versus physical segmentation – File allocation on file system must support multiple
stream access and low latency
Indexing and Retrieving
• Index – A very high dimensional
binary vector– Encoding of content
features– Text-based content can
be represented with term vectors
– A/V content features can be either Boolean vectors or term vectors
• Retrieval– Retrieval is a pattern
classification problem– Use index vector as the
feature vector– Classify each object as
relevant and irrelevant to a query vector (template)
– A perceptually consistent similarity measure is essential
Term Vector Query
• Each document is represented by a specific term vector• A term is a key-word or a phrase • A term vector is a vector of terms. Each dimension of the vector
corresponding to a term. • Dimension of a term vector = total number of distinct terms.• Example:
Set of terms = [tree, cake, happy, cry, mother, father, big, small]
document = “Father gives me a big cake. I am so happy”, “mother planted a small tree”
Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1]
– A probabilistic term vector representation.– Relative Term Frequency (within a document)
tf (t,d) = count of term t / # of terms in document d
– Inverse document Frequency
df(t) = total count of document/ # of doc contain t
– Weighted term frequency
dt = tf(t,d) · log [ df(t)]
– Inverse document frequency term vector D = [d1, d2, … ]
Inverse Term Frequency Vector
ITF Vector Example
Document 1: The weather is great these days.
Document 2: These are great ideas
Document 3: You look great
Eliminate: The, is, these, are, you
Term tf(t,1) tf(t,2) tf(t,3) df(t) D1 D2 D3Weather 1/6 0 0 3 0.08 0.00 0.00great 1/6 1/4 1/3 1 0.00 0.00 0.00day 1/6 0 0 3 0.08 0.00 0.00idea 0 1/4 0 3 0.00 0.12 0.00look 0 0 1/3 3 0.00 0.00 0.16
Human Computer InterfaceVoice, gesturepush button/keyexpression, eye
Command
DataSensation: visualaudio, pressuresmell: virtual environment
HCI is a match-maker: Matchingthe needs of human and computers
Basic HCI Design Principles
• Consistency: Same command means the same thing
• Intuition: Metaphor that is familiar to the user
• Adaptability: Adapt to user’s skill, style
• Economy: Use minimum efforts to achieve a goal
• Non-intrusive: Do not decide for user without asking
• Structure: Present only relevant information to user in a simple manner.
User Models
• User Profiles:– Categorize users using features relevant to tasks– Static features: age, sex, etc.– Dynamic features: activity logs, etc. – Derived features: skill levels, preferences, etc.
• Use of Profiles for HCI– Adaptation: Customize HCI for different category
of users– Better understanding of user’s needs
Principles of Dialogue Design
• Feedback: Always acknowledge user’s input
• Status: Always inform users where are they in the system
• Escape: Provide a graceful way to exit half way.
• Minimal Work: Minimize amount of input user must provide
• Default: Provide default values to minimize work
• Help: Context sensitive help
• Undo: Allow user to make unintentional mistake and correct it
• Consistency:
• Document retrieval problem is a hypothesis testing problem:
H0: di is relevant to q (r=1)
H1: di is irrelevant to q (r=0)
• Type I error (Pe1=P{r=0|H0}) Relevant but not retrieved.
• Type II error (Pe2 =P{r=1|H1}) : Irrelevant but retrieved.
Contingency table for evaluating retrieval
Performance Evaluation
• Precision Recall Curve– P(recision) = w/(w+y) is a
measure of specificity of the result
– R(ecall) = w/(w+x) is an indicator of completeness of the result.
• Operating curve– Pe1 = x/(w+x) = 1 – R– Pe2 = y/(y+z) = F(allout)
• Expected search length = average # of documents need to be examined to retrieve a given number of relevant documents.
• Subjective criteria
Retrieved Not retrievedRelevant w xIrrelevant y z
Retrieved Not retrievedRelevant w xIrrelevant y z
Example: MetaSEEk
• MetaSEEk-A meta-search engine– Purpose: retrieving images– Method: Select and interface with multiple on-line
image search engines– Search Principle: Performance of different query
classes of search engines and their search options
A. B. Benitez, M. Beigi, and S.-F. Chang, Using Relevance Feedback in Content-Based Image Metasearch, IEEE Internet Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998
Basic idea of MetaSEEk
• Classify the user queries into different clusters by their visual content.
• Rank the different search engines according to their performance for the different classes of user queries
• Select the search engines and search options according to their rank for the specific query cluster
• Display the search results to User• Modify these performance according to the user
feedback
Content-Based Visual Query (1)
• Advantage – Ease of creating, capturing and collecting digital
imaginary
• Approaches– Extract significant features (Color, Texture, Shape,
Structure)– Organize Feature Vectors– Compute the closeness of the feature vectors– Retrieve matched or most similar images
Content-Based Visual Query (2)Improve Efficiency
• Keyword-based search– Match images with particular subjects and narrow
down the search scope
• Clustering– Classify images into various categories based on
their contents
• Indexing– Applied to the image feature vectors to support
efficient access to the database
Cluster the visual data
• K-means algorithm– Simplicity– Reduced computation
• Tamura algorithm (for text)• For Color, feature vector are calculated using
the color histogram• Using Euclidean distance
Multimedia summary and filtering
• Summary– Text: email reading– Image: caption
generation– Video: high-lights, story
board
• Issues: – Segmentation– Clustering of segments– Labeling clusters– Associate with syntactic
and semantic labels
• Filtering– Same as retrieval: filter
out irrelevant objects based on a given criterion (query)
– Often need to be performed based on content features
• E.g. filtering traffic accidents or law violations from traffic monitoring videos
Content based Coding and Post-processing
• Different coding decisions based on low level content features – coding mode (inter/intra
selection)– motion estimation
• Object based coding– Encoding different
regions (VOP) separately– Using different coder for
different types of regions
• Multiple abstraction layer coding– An analysis/synthesis
approach
– Synthesize low level contents from higher level abstraction
• E.g. texture synthesis
• Content based post-processing– Identify content types and
en synthesize low level content