from tens to thousands: efficient methods for learning large-scale video concepts

© 2007 IBM Corporation

From Tens to Thousands: Efficient Methods for Learning Large-Scale Video Concepts

Rong Yan

IBM T. J. Watson Research CenterHawthorne, NY 10532 USAEmail: [email protected]

© 2007 IBM Corporation2 04/20/23

The growth in video search has potential to benefit both enterprise and consumer segments across the world

0102030405060708090

100110120130

2006 2007 2008 2009 2010

0

1

2

3

4

5

6

7

8

9

Vid

eo S

trea

ms

Ser

ved

(B

)

Video Streams Served

Advertising Spending

Advertising S

pending ($B)

Growth in Online Video (U.S.)Video Streams Served and Online Video Advertising

Spending

Sources: eMarketer Research, Veronis Suhler StevensonResearch, AccuStream Market research


Though there are numerous video-search options, none of them have yet proven to be reliable and accurate

Google Video — “Basketball”

Scope of results– Does not broadly search the Web

– Does not search inside video

– Cannot distinguish matches showing basketball

Favors Google silo– YouTube videos prominent

YouTube — “Basketball”

Scope of results– Similar results to Google Video

Favors own silo

Video quality mixed– User-generated content and

user provided video

SearchVideo (AOL) — “Basketball”

Scope of results– Top matches all related to Imus

comments

– Again limited by inability to detect basketball scenes

Favors own silo– Preference for AOL and AOL

partner content

SearchVideo (Blinkx) — “Basketball”

Scope of results– 214,000 matches related to

“basketball”

– No way to limit results to relevant scenes showing basketball games

Favors own silo– Results limited to partner

Issues of Current Video Search Systems

• Based on text metadata and/or manual tags, which are not available for lots of video

• Unable to search inside video clips, which are typically associated with clip-level metadata


Concept-based Video Search

Exciting new direction

– Visual indexing with semantic concept detection

– (Semi-)automatically produce frame-level indexing based on statistical learning techniques

– Search by text keywords without text metadata

(Courtesy of Lyndon Kennedy)


Concept-based Video Search


Thousands of video concepts are required to produce good performance for concept-based video retrieval

Need ~3000 video concepts to have similar performance with web retrieval

– Extrapolate search results on 3 standard large-scale video collections

– Concept detection accuracy and combination strategies are calibrated with the state-of-the-art results

– Details: [Hauptmann, Yan and Lin]


Challenges: Efficient (and Effective) Approaches to Detect Thousands of Video Concepts are yet to be Developed

Case study: TRECVID’05-’07

– 39 video concepts are defined

– A baseline SVM classifier takes ~7 days to learn on 100,000 frames for 39 concepts using a 2.16GHz Dual-Core CPU

– It takes ~3.5 days to generate predictions on 1 million testing frames for 39 concepts

– Need 30 machines for 39 concepts if processing 100 frames per second

TREC Concepts

Program

Weather

Entertain

Sports

Location

Office

Meeting

Studio

Outdoor

Road

Sky

Snow

Urban

Water

Mountain

Desert

People

Crowd

Face

Person

Roles

Objects

Flag-US

Animal

Screen

Vehicle

Airplane

Car

Boat

Bus

Building

Plants

Court

G. Leader

C. Leader

Police

Prisoner

MilitaryTruck

Activities

People

Walk

March

Events

Explosion/Fire

Natural Disaster

Graphics

Maps

Chart


New Approaches for a Wide Spectrum of Video Concepts

DomainDependentConcepts

Domain Independent Concepts

Digital item

Learnability

Domain-IndependentConcepts

Concepts that can be learned across multiple domains

Size Tens

Example Sky, Urban, NightDomain-Dependent

Concepts

Concepts that can be learned on some specific domains

Size Hundreds

Example Anchor, Basketball

Out-of-Domain Concepts

Out-of-Domain Concepts

Concepts that are difficult to be learned with low-level features

Size Thousands

Example Paris, Grandma

Automatic: Model-shared

subspace boosting

Automatic: Model-shared

subspace boosting

Semi-automatic: Cross-domain

Concept Adaptation

Semi-automatic: Cross-domain

Concept Adaptation Semi-manual:

Learning-based Hybrid Manual

Annotation

Semi-manual: Learning-based Hybrid Manual

Annotation


Roadmap

Motivation and Challenges: Why Efficiency?

(Automatic) Model-shared Subspace Boosting [KDD’07]

(Semi-automatic) Cross-domain Concept Adaptation [MM’07]

(Semi-manual) Learning-based Hybrid Annotation [Submitted]

Conclusions


Prior Art on Automatic Concept Detection

Standard multi-label classification [City U., 07] [IBM, 07] [Tsinghua, 07]

– Need to learn an independent classifier for every possible label using all the data examples and the entire feature space.

Other image annotation methods [Snoek et al., 05] [Torralba et al, 04]

– No mechanisms to reduce the redundancy among labels other than making use of the multi-label relations.

Multi-task learning [Ando and Zhang, 05] [Caruana, 97] [Zhang et al., 05]

– Treat each label as a single task and use them in an iterative process

– Complex and inefficient inference effort to estimate the task parameters


Related Work: Random Subspace Bagging

Improve computational efficiency by removing the redundancy in both data space and feature space

1. For each concept, select a number of bags of training examples, where each bag is randomly sampled from training data as well as feature space

2. Learn a base model on each bag of training examples using arbitrary learning algorithms

3. Add them into a composite classifier

A.k.a. asymmetrical bagging and random subspace, or random forest (w. decision trees)

Features

Training

Exam

ples

M1M1 M2M2

ClassifiersClassifiers


Missing Pieces for Random Subspace Bagging

RSBag learns classifiers for each concept separately, which thus cannot reduce information redundancy across multiple concepts

– It is possible to share and re-use some base models for different concepts

M1M1 M2M2

Label 1: Car Label 1: Car

M1M1 M2M2

Label 2: RoadLabel 2: Road

M2M2 M2M2


Model-shared subspace boosting [with J. Tesic and J. Smith]

Model-shared subspace boosting (MSSBoost) iteratively finds the most useful subspace base models, shares them across concepts, and combines them into a composite classifier for each concept.

– MSSBoost follows the formulation of LogitBoost [Friedman et al., 1998]

– The base models are learned from bootstrapped data samples and selected feature space, which can be trained from any algorithms

– The classifier for each concepts is an ensemble of multiple base models

– The base models are shared across multiple concepts, so that the same base models can be re-used in different decision functions.


MSSBoost Algorithm: Overview

Step 1 (model initialization)

– initialize a number of base models for each label, where each base model is learned on a label using random subspace and data bootstrapping

Step 2 (iterative update)

1. Search the model pool for the optimal base model and its weight by minimizing a "joint logistic loss function" over all the concepts

2. Update the classifier of every concept by sharing and combining the selected model

3. Replace this model by a new subspace model learned on the same concept

11

22

33

11

22

11

11

22

33

F1 F2

22

F3

11

Base Models

L1 L2 L3

11 11

1111

Composite Classifiers

Labels

,

min log 1 exp ( ( ) ( )t t

t til l i i

h il

y F x h x

( ) ( ) ( ) 1..t tl l lF x F x h x l L


Experiments

Two large-scale image/video collections – TRECVID’05 sub-collection, including 6525 keyframes with 39 concepts– Consumer collection, including 8390 images with 33 concepts

Low-level visual features– 166-dimensional color correlogram– 81-dimensional color moments – 96-dimentional co-occurrence texture

RBF-kernel support vector machinesas base models for MSSBoost

75%-25% training-testing split


Concept Detection Performance

MSSBoost outperforms baseline SVMs using a small number of base models (100

for 39 labels) w. small data/feature sample ratio (~0.1)

MSSBoost consistently outperforms RSBag and non-sharing boosting (NSBoost)– e.g. # of models to achieve 90% baseline MAP is only 60% of that of RSBag / NSBoost

TRECVID Collection Consumer Collection


Concept Detection Efficiency

MSSBoost vs. baseline SVMs (with the same classification performance)

– 60-fold / 170-fold speedup on training and 20-fold / 25-fold speedup on testing

Training Time Testing Time

0

100

200

300

400

500

600

700

TREC Photo

Pre

dic

tio

n T

ime

(sec

.)

Baseline MSSBoost (100%)

TRECBaseline MSSBoost

5898 sec. 94 sec.

PhotoBaseline MSSBoost

5158 sec. 31 sec.


Roadmap



– Automatically exploit information redundancy across the concepts



Conclusions


Cross-domain Concept Detection

Adapt concept classifiers from one domain to other domains

– Domains can be genres, data sources, programs, e.g., “CNN”, “CCTV”

– Adapt from auxiliary dataset(s) to a target dataset

Adaptation is more critical for video (than text)

– Bigger semantic gap, e.g., “tennis”

– More sensitive to domain change

– e.g., average precision of “anchor” drops from 0.9 on TREC’04 to 0.5 on TREC’05


Prior Art on Cross-Domain Detection

Data-level adaptation [Wu et al., 04] [Liao et al., 05] [Dai et al., 07]

– Combine auxiliary and target data for training a new classifier

– Computational expensive due to the large number of training data

Parametric-level adaptation [Marx et al., 05] [Raina et al., 06] [Zhang et al., 06]

– Use the model parameters of auxiliary data as prior distribution

– Model must be parametric and be of the same type

Incremental Learning [Syed et al., 99] [Cauwenberghs and Poggio, 00]

– Continuously update models with subsets of data

– Assume the same underlying distributions without any domain changes

Sample bias correction, concept drift, speaker adaptation...


Function-level Adaptation [with J. Yang and A. Hauptmann]

Function-level adaptation: modifies the decision function of old models

– Flexibility: auxiliary classifier can be “black-box” classifiers of any type

– Efficiency: auxiliary data is NOT involved in training

– Applicability: even if the auxiliary data is not accessible

auxiliary classifier

adapted classifier

delta function

=+

target data

auxiliary data


Learning “Delta Function”: Risk Minimization

General framework: regularized empirical loss minimization

(1) classification errors measured by any loss function L(y,x), and

(2) complexity (norm) of Δf(x), which equals to the distance between auxiliary and adapted classifier in the function space.


Illustration of function-level adaptation

Intuition: seek the new classification boundary that (1) is close to the original boundary and (2) can correctly classify the labeled examples

– Cost factor C to determine the contribution of auxiliary classifiers

??

?

?

?

???

??

??

?

??

?

??

?

? ?

??

?

?

?

Auxiliary data Target data


Adaptive SVMs

Adaptive SVMs: a special case of adaption with hinge loss function

A quadratic programming (QP) problem solved by modified sequential minimal optimization (SMO) algorithm

Training cost: similar to SVMs other than the one-time cost of computing auxiliary prediction

Adapted classifier:


Experiments

TREC Video Retrieval Evaluation (TRECVID) 2005

– 74,523 video shots, 39 labels, 13 programs from 6 channels

– Adapt concepts learned from one program to another program

Name Training data Algorithm

Our approach Adapted classifier (Adapt)

Target-Prog (labeled) Adaptive SVMs

Baseline approach

Auxiliary classifier (Aux)

Aux-Prog SVMs

Target classifier (Target)

Target-Prog (labeled) SVMs

Competing approach

Aggregation classifier (Aggr)

Aux-Prog + Target-Prog (“early fusion”)

SVMs

Ensemble classifier (Ensemble)

Aux-Prog + Target-Prog (“late fusion”)

SVMs


Cross-Domain Detection Performance

Average Precision: Adapt > Aggr ≈ Ensemble > Aux > Target

– Using knowledge of auxiliary data almost always help in this setting

More classification results in the paper [MM’07]


Cross-Domain Detection Efficiency

Total training time for 39 concepts and 13 programs

Training cost: Target = Ensemble < Adapt << Aggr

Adaptive SVMs achieve good tradeoff between concept detection effectiveness and efficiency


Roadmap



– Automatically exploit information redundancy across the concepts


– Function-level adaptation with high efficiency and flexibility


Conclusions


Manual Concept Annotation

Limitations of automatic annotation

– Needs to have sufficient training data

– Sometimes hard to learn from low-level visual features

Popularity of manual annotation

– High annotation quality and social bookmarking functionality

– Labor-expensive and time-consuming

– “Vocabulary mismatch” problem

“Book” (Flickr)

How about speeding up manual annotation?

Let users drive, but computers suggest the right words / images /

interface to annotate

How about speeding up manual annotation?

Let users drive, but computers suggest the right words / images /

interface to annotate


Related Work on Efficient Manual Annotation

Active learning: maximizing automatic annotation accuracy with a minimal amount of manual annotation effort

– Aim to optimize the learning performance, instead of annotation time

– Automatically annotate most images (inaccurate), of which the learning performance largely depends on underlying low-level features

– Annotate most “ambiguous” images, leading to poor user experience

Leveraging other modalities: e.g., speech recognition, semantic network, time/location

– Require support from other information sources


Challenges and Proposed Work

Challenges on investigating manual annotation

– No formal time models exist for manual annotation

– Require large-scale user study, which can result in a time-consuming annotation process and a high user variance

– Provide no guidance on developing better manual annotation approaches

Proposed work

– Formal time models for two annotation approaches: tagging / browsing

– A much more efficient annotation approach based on these models


Manual Annotation (I) : Tagging

Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches)

Advantages

– Freely choose arbitrary keywords to annotate

– Only need to annotate relevant keywords

Disadvantages

– “Vocabulary mismatch” problem

– Inefficient to design and type keywords

Suitable for annotating rare keywords


Formal Time Model for Tagging

Annotation time for one image:

– Factors: number of keywords K, time for kth word t’fk , setup time for new image t’s

Total expected annotation time for an image collection

– Assumption: the expected time to annotate the kth word is constant tf

User study on TRECVID’05 development data

– manually tag 100 images using 303 keywords

– If the model is correct, a linear fit should be found in the results

– The annotation results fit the model very welltf = 6.8sec, ts = 5.6sec

1 ...f fk s fk sk

T t t t t t

( )l

l

total fk s l f sl k l

E T E t E t K t t


Manual Annotation (II) : Browsing

Allow users to associate multiple images with a single word at the same time

Advantages

– Efficient to annotate each pair of images / words

– No “vocabulary mismatch”

Disadvantages

– Need to judge both relevant and irrelevant pairs

– Start with controlled vocabulary

Suitable for annotating frequent keywords


Formal Time Model for Browsing

Annotation time for all images w.r.t. a keyword:

– # of relevant images Lk , annotation time for an (ir)relevant image t’p (t’n)

Total expected annotation time for an image collection – Assumption: the expected time to annotate a relevant (irrelevant) image is constant

User study on TRECVID’05 development data

– Three users to manually browse images in 15 minutes ( for 25 keywords )

– A linear fit should be found in the results

– The annotation results fit the model for all userson average, tp = 1.4sec, tn = 0.2sec

1 1

k kL L L

pl nll l

T t t

( )k k

k k

total pl nl k p k nk l l k

E T E t E t L t L L t


Learning-based Hybrid Annotation [with A. P. Natsev and M. Campbell]

Combine both tagging and browsing interfaces to optimize the annotation time for manually annotating the image/video collections

– Formally model the annotation time as functions of word frequency, time per word, and annotation interfaces

– Learning the visual patterns of existing annotation on the fly

– Automatically suggest the right images, keyword, and annotation interface (tagging vs. browsing) to the users to minimize overall annotation time

– Combine the advantages of both tagging and browsing


An Illustrative Example for Hybrid Annotation

Users start annotation process from the tagging interface

– No limitation on the keywords

(Automatically) switch to the browsing interface to annotate a set of selected images

– Predicted as relevant to a given keyword with high confidence

– Spend much less time to annotate images without re-typing the same keyword

Switch to the tagging interface when necessary

TaggingBrowsing


Simulation Results

Results on two large-scale collections: TRECVID and Corel

– More accurate than automatic annotation (100% accurate)

– More efficient than tagging / browsing annotation (2-fold speedup)

– More effective than tagging / browsing in a given amount of time

TRECVID Collection Corel Collection


Empirical Results

A user spend 1 hour in annotating 10 TRECVID videos using tagging, browsing and hybrid annotation

– The proposed time models correctly estimate the true annotation time

– Hybrid annotation provides much better annotation results

Method Estimate True

Tag 3649s 3600s

Browse 3603s 3608s

Hybrid 3478s 3601s

Empirical PerformanceEstimated Annotation Time


Conclusions: Efficient Approaches for Learning Large-Scale Video Concepts

Automatic: Model-shared Subspace Boosting

– Automatically exploit information redundancy across concepts

– Orders of magnitude speedup on both training and testing process

Semi-automatic: Cross-domain Concept Adaptation

– Function-level adaptation with high efficiency and flexibility

– Fast cross-domain model update with limited number of training data

Semi-manual: Learning-based Hybrid Annotation

– Optimize overall annotation time using formal annotation time models

– Significantly faster that simple tagging or browsing with accurate annotation


Backup


Properties of MSSBoost

Adaptive Newton methods to compute the combination weights αt by minimizing the joint logistic loss function [Proposition 1]

The learning process is guaranteed to converge after a limited number of steps under some general conditions [Theorem 3]

Computational complexity can be considerably reduced by using small sampling ratios, and sharing base models across labels

– 100 base models w. data/feature sampling ratio 20% for 40 concepts

– Achieve a 50-fold speedup for training, a 10-fold speedup for testing

1( )1

1 1 1

( ),..., ( ) 1

( (1 ),..., (1 )) ( ,..., )

l iT F xt t

N il il il il

l l l Nl Nl l l Nl

where H h x h x p e z y p

W diag p p p p Z Z Z


Problem Formulation

Adapt classifiers trained on auxiliary datasets to a target dataset

– Assumption 1: target data follows a different but related distribution

– Assumption 2: limited target examples are additionally collected

Auxiliary data Target data

Auxiliary classifier

Adapted classifier

biased

high variance

new classifier

trainapply

train apply

adapt

Bias-variance tradeoff


Example: Synthetic Data Examples

2-D data examples

1000 data points w. 3 labels and optimal decision boundary


Example: Results of Random Subspace Bagging

Random Subspace Bagging

– Base model: decision stump (1-level tree)

– 8 base models

RS-Bag cannot model the decision boundary well with such a small number of base models


Example: Results of MSSBoost

Model-Shared Subspace Boosting

– 8 base models

MSSBoost can model the decision boundary much better than its non-shared counterpart with the same number of base models

In other words, it can improve classification efficiency without hurting performance

from tens to thousands: efficient methods for learning large-scale video concepts

Documents