learning and understanding from multimodal signalscylin/ibm/ntu-multimodality-talk.pdf · learning...

27
1 Business Unit or Product Name © 2003 IBM Corporation IBM Research © 2004 IBM Corporation Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory Stream Processing Systems IBM T. J. Watson Research Center Business Unit or Product Name © 2003 IBM Corporation 2 IBM Research Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation Outline – Learning and Understanding from Multimodal Signals ip http ntp udp tcp ftp rtp rtsp video audio • Motivation Understanding Multimodality Sensing Signals Learning from Multimodality Information Mining Large-Scale Multimodality Streams • Conclusions

Upload: others

Post on 08-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

1

Business Unit or Product Name

© 2003 IBM Corporation

IBM Research

© 2004 IBM Corporation

Learning and Understanding

from Multimodal Signals

December 23, 2005

Ching-Yung Lin

Exploratory Stream Processing Systems

IBM T. J. Watson Research Center

Business Unit or Product Name

© 2003 IBM Corporation2

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Outline – Learning and Understanding from Multimodal Signals

iphttp

ntp

udp

tcpftp

rtp

rtsp

video

audio

• Motivation

• Understanding Multimodality Sensing Signals

• Learning from Multimodality Information

• Mining Large-Scale Multimodality Streams

• Conclusions

Page 2: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

2

Business Unit or Product Name

© 2003 IBM Corporation3

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Outline – Learning and Understanding from Multimodal Signals

iphttp

ntp

udp

tcpftp

rtp

rtsp

video

audio

• Motivation

• Understanding Multimodality Sensing Signals

• Learning from Multimodality Information

• Mining Large-Scale Multimodality Streams

• Conclusions

Business Unit or Product Name

© 2003 IBM Corporation4

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

A picture is worth 1000 words !!?? Which 1000 !!?? Speech Recognition has been active for 30 years…How about video??Can machine describe what she sees??

Page 3: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

3

Business Unit or Product Name

© 2003 IBM Corporation5

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

A picture’s worth one thousand words…

A lovely couple hand-in-hand walking together!!

Business Unit or Product Name

© 2003 IBM Corporation6

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

A picture is worth seven words

By Prof. Elliott, Dept. of Communications, Cal State Univ. Fullerton

Sometimes the picture and the words match perfectly.

Page 4: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

4

Business Unit or Product Name

© 2003 IBM Corporation7

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

A picture’s worth one thousand words…

Therefore, sometimes it can be used for implication!!although sometimes it lies…

Business Unit or Product Name

© 2003 IBM Corporation8

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Motivation

� Understanding Multimodality Sensing Signals

– Recognize generic visual, audio, text and behavior information

� Learning from Multimodality Information

– Utilize recognition result

• Cross-Modality Learning

• Integrated Learning

� Mining Large-Scale Multimodality Streams

– Monitor information streams

Page 5: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

5

Business Unit or Product Name

© 2003 IBM Corporation9

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

The Angle of My Views -- Research Path

� In NTU (under Prof. Pei) (‘91-’93) :

– Image/Video Pattern Recognition and Compression

� In Columbia U. (under Prof. Anastassiou and Prof. Chang) (’96-’00) :

– Multimedia Security

� In IBM Research (with Dr. Smith, Dr. Tseng, Dr. Naphade, Dr. Natsev, Dr. Iyengar,

Dr. Nock, Dr. Chalapathy, etc) (’00-’04) :

– Semantic Recognition of Video Content

– NIST TREC Video Retrieval Benchmarking

� In IBM Research, U. of Washington and Columbia U. (’04-’05) :

– Multimodality Signal Learning, Classification and Filtering

• Imperfect Learning, Autonomous Learning

• Large-Scale Stream Information Filtering

• Modeling Human Behavior and Social Dynamics

• Developing Smart Mobile/Wearable Multimedia Sensors

Business Unit or Product Name

© 2003 IBM Corporation10

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

My Social Network – Current Collaborators

Victor Sutan

Jason Cardillo

Columbia Univ.

Dr. Lisa Amini

Dr. Oliver Verscheure

Dr. Anshul Sehgal

Dr. Upendra Chaudhari

Dr. Xiaohui Gu

Navneet Panda (UCSB)

Xiaodan Song

Ya-Ti Peng

Prof. Ming-Ting Sun

Dr. Belle Tseng

Coach

Table Light

Sensors

Univ. of

Washington

ego

Page 6: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

6

Business Unit or Product Name

© 2003 IBM Corporation11

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Outline – Learning and Understanding from Multimodal Signals

iphttp

ntp

udp

tcpftp

rtp

rtsp

video

audio

• Motivation

• Understanding Multimodality Sensing Signals

• Learning from Multimodality Information

• Mining Large-Scale Multimodality Streams

• Conclusions

Business Unit or Product Name

© 2003 IBM Corporation12

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Video Semantic Concept Detection & Mining

context

� Multimedia Concept Detection:

� Objects:

� Visual Objects:, Tree, Person, Hands, …

� Audio Objects: Music, Speech, Sound, …

� Scenes:

� Background: Building, Outdoors, Sky

� Relationships:

� The (time, spatial) relationships between objects & scenes

� Activities:

� Holding Hand in Hand, Looking for Stars

A picture is worth 1000 words – which one thousand?

Page 7: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

7

Business Unit or Product Name

© 2003 IBM Corporation13

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Video Semantic Concept Detection

� Design Ontology: Video => Text

� Validation Metrics:

Concept Model

Accuracy

Concept Model EfficiencyConcept Coverage

Business Unit or Product Name

© 2003 IBM Corporation14

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Shot Segmentation

Semi-ManualAnnotation

Feature Extraction

BuildingClassifier Detectors

Region Segmentation

Training Video Corpus

Lexicon

• Regions– Object (motion, Camera registration)

– Background (5 lg regions / shot)

Features• Color:

– Color histograms (72 dim, 512 dim), Auto-Correlograms (72 dim)

• Structure & Shape:

– Edge orientation histogram (32 dim), Dudani Moment Invariants (6 dim), Aspect ratio of bounding box (1dim)

• Texture:

– Co-occurrence texture (48 dim), Coarseness (1 dim), Contrast (1 dim), Directionality (1 dim), Wavelet (12 dim)

• Motion:

– Motion vector histogram (6 dim)

Supervised Learning for Building Generic Concept Detectors

Supervised learning: Classification and Fusion:• Support Vector Machines (SVM)

• Ensemble Fusion

• Other Fusions (Hierarchical, SVM, Multinet, etc.)

Page 8: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

8

Business Unit or Product Name

© 2003 IBM Corporation15

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Supervised Learning on Automatic Speech Recognition Results of Video

� Generating documents by collecting the words occurring symmetrically around the center of a shot ( +-2 surrounding shots)

� Mapping the documents from the annotations of the shots for a particular concept

� Removing stop-words and stemming

� Calculating Information gain (IG) [Yang and Pedersen, ICML 1997] for each term to select the keywords

� Ordering the keywords by the decreasing of the information gain � Observations

– Part of the words in these two keyword lists are the same

– Some keywords generated by supervised learning represent the context relationship instead of the lexical or semantic relationship.

• “Africa” has a high value of information gain for Animal

• “Airplane” and “Iraq”

rain

temperature

weather

forecast

shower

storm

thunderstorm

snow

lake

southeast

warm

meteorologist

house

president

school

damage

tornado

court

city

destroy

town,

police

residence

building

animal

Africa

park

safari

nation

land

wild

gorilla

wildlife

elephant

extinct

breed

Plane

fly

ground

military

land

weapon

pilot

generator

airplane

hospital

war

Iraq

Weather newsBuildingAnimalAirplane

Example of Topic – Related Keywords

Business Unit or Product Name

© 2003 IBM Corporation16

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

IBM Research Video Concept Detection Systems

EF

EF2

Post-

processing

BOU BOF BOBO

Uni-models Multi-Modality

Fusion

Multi-

Model Fusion

MLP

DMF17

MN

ONT

DMF64

V1

V2

V1

V2

Best Uni-model

Ensemble Run

Best Multi-modal

Ensemble Run

Best-of-the-Best

Ensemble Run

Annotation and

Data

Preparation

Videos (128 hours

MPEG-1 )

Collabor

ative

Annotati

on

Fe

atu

re

Ex

trac

tion

CH

WT

EH

MI

CT

MV

TAM

CLG

AUD

CC

Re

gio

n

Ex

trac

tion

Filte

ring

VW

MLP

Feature Extraction

SD/A

By IBM TRECVID

2003 Team

Page 9: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

9

Business Unit or Product Name

© 2003 IBM Corporation17

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Performance Metric -- Precision-Recall Curve and Average Precision

� Example:

– Find shots from behind the pitcher in a

baseball game as the batter is visible

�Average Precision :

where Ni are all the ranks of the relevant retrieved shots.

∑=iN

i

GT

NPN

AP @1

Retrieved

SetReal Set

Precision

Recall

Average Precision

Business Unit or Product Name

© 2003 IBM Corporation18

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

NIST TREC Video Benchmark

� Semantic Concept Detector Examples:

Topic 2. Scenes that depict a lunar

vehicle traveling on the moon.

Topic 13. Speaker talking in front of

the US flag.

Topic 48. Examples of overhead

zooming-in views of canyons in

Western US.

• Tasks: Shot Boundary Detection (2001 - 2005)

Semantic Video Retrieval Query (2001 - 2005)Semantic Concept Detection (2002 - 2005)Story Boundary (2003 – 2004), Camera Motion (2005), Exploratory BBC(2005)

• Corpus: 2001 – 14 hours from NASA and BBC

2002 – 74 hours from Internet Movie Archive2003, 2004 – 192 hours from CNN, ABC, etc.2005 – 170 hours from LBC (Arabic), CCTV, NTDTV (Chinese), CNN, NBC

• Video Retrieval Topic Examples:

“Text-overlay” “Outdoors” “People”

Page 10: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

10

Business Unit or Product Name

© 2003 IBM Corporation19

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

010

2030

405060

7080

90100

Out

door

s

NS_F

ace

Peo

ple

Bui

ldin

g

Roa

d

Veg

etat

ion

Ani

mal

Fem

ale_

Spe

ech

Car

/Tru

ck/B

us

Airc

raft

NS_M

onol

ogue

Non

-Stu

dio_

Set

ting

Spo

rts_

Eve

nt

Wea

ther

Zoo

m_I

n

Phy

sica

l_Vio

lenc

e

Mal

delin

e_Alb

right

Mea

n

Best IBM

Best Non-

IBM

Reference – Comparison of Precision at Top 100 and Average Precisions

00.1

0.20.3

0.40.50.6

0.70.80.9

1

Outd

oors

NS_F

ace

Peo

ple

Bui

lding

Roa

d

Veg

etat

ion

Ani

mal

Fem

ale_

Spee

ch

Car

/Tru

ch/B

us

Airc

raft

NS_M

onol

ogue

Non

-Stu

dio_

Set

ting

Spor

ting

Wea

ther

Zoo

m_I

n

Phy

sica

l_Vio

lenc

e

Maldel

ine_

Alb

right

Mean

Best IBM

Best Non-

IBM

Precision @

Top 100

Official NIST

Average

Precision

Values

Business Unit or Product Name

© 2003 IBM Corporation20

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Evaluation of the TREC 2005 vs. TREC 2004 videos

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pers

on

Face

Stu

dio

Outd

oor

Weath

er

Ente

rtain

ment

Sky

Anim

al

Com

puter_

TV-S

creen

Maps

Cro

wd

Sports

Wate

rscape_W

aterfr

ont

Road

Car

Milit

ary

Govern

ment-L

eader

Snow

Vegeta

tion

Urb

an

People

-Marc

hing

Build

ing

Mounta

in

Walk

ing_R

unning

Natu

ral-D

isaste

r

Meetin

gTxt

Exp

losio

n_Fire

Boat_

Ship

Charts

Court

Offi

ce

Desert

Polic

e_Securit

y

Truck

Airp

lane

Flag-U

S

Pris

onerB

us

Corp

orate

-Leader

Mean A

verage P

reci

sion

Concept

Av

era

ge

Pre

cis

ion

CDS2004

CDS2005

Page 11: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

11

Business Unit or Product Name

© 2003 IBM Corporation21

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Demo -- Novel Semantic Concept Filters

� http://www.research.ibm.com/VideoDIG

� E.g.:

Business Unit or Product Name

© 2003 IBM Corporation22

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Some Key Lessons Learnt

� Visual information and speech information are

roughly as important.

� The more detectors, the better.

� Scalability is a big issue.

Page 12: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

12

Business Unit or Product Name

© 2003 IBM Corporation23

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Outline – Learning and Understanding from Multimodal Signals

iphttp

ntp

udp

tcpftp

rtp

rtsp

video

audio

• Motivation

• Understanding Multimodality Sensing Signals

• Learning from Multimodality Information

• Mining Large-Scale Multimodality Streams

• Conclusions

24

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

A solution for the scalability issues at training..

� Autonomous Learning of Video Concepts through Imperfect Training

Labels:

�Develop theories and algorithms for supervised concept learning from

imperfect annotations -- imperfect learning

�Develop methodologies to obtain imperfect annotation – learning from cross-

modality information or web links

�Develop algorithms and systems to generate concept models – novel

generalized Multiple-Instance Learning algorithm with Uncertain Labeling

Density

Autonomous Concept Learning

Cross-Modality

Training

Imperfect

Learning

Page 13: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

13

25

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

There are scalability problems in the supervised video semantic annotation framework

�For training:

� Tremendous human effort required: we required extensive human labeling effort to

have ground truth for training. E.g., 111 researchers from 23 groups annotated

460K semantic labels on 62 hours of videos in 2003.

� Concept ontology is pre-defined: we won’t be able to train more basic concepts if

they are not annotated. For instance, 133 concepts in the 2003 ontology.

ConceptModels

MultimodalFusion

NaturalLanguageProcessing

AudioFeatures

VisualFeatures

VisualSegmentation

MPEG-7Annotation

Video Repository

Audio Models

SpeechModels

VisualModels

modelsfeatures fusion

Business Unit or Product Name

© 2003 IBM Corporation26

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Also -- To Err is Human

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Annotation 1

Annotation 2

0319_ABC0319_CNN0328_CNN0330_ABC0330_CNN0329_CSPAN0408_CSPAN0613_CSPAN1214_CSPAN0614_CSPAN

� 10 development videos are annotated by two annotators

in Video Concept Annotation Forum 2003.

Page 14: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

14

27

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Can concept models be learned from imperfect labeling?

Example: The effect of imperfect labeling on classifiers (left -> right: perfect labeling,

imperfect labeling, error classification area)

28

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Imperfect Learning: theoretical feasibility

� Imperfect learning can be modeled as the issue of noisy training

samples on supervised learning.

� Learnability of concept classifiers can be determined by probably

approximation classifier (pac-learnability) theorem.

� Given a set of “fixed type” classifiers, the pac-learnability identifies a

minimum bound of the number of training samples required for a

fixed performance request.

� If there is noise on the training samples, the above mentioned

minimum bound can be modified to reflect this situation.

� The ratio of required sample is independent of the requirement of

classifier performance.

� Observations: practical simulations using SVM training and detection

also verify this theorem.

Page 15: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

15

29

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

PAC-identifiable

2 2

4 2 8 13max( log , log )

dm

ε δ ε ε≥

�PAC-identifiable: PAC stands for probably approximate correct. Roughly, it tells us a class of concepts C (defined over an input space with examples of size N) is PAC learnable by a learning algorithm L, if for arbitrary small δ and ε, and for all concepts c in C, and for all distributions D over the input space, there is a 1-δprobability that the hypothesis h selected from space H by learning algorithm L is approximately correct (has error less than ε).

�Based on the PAC learnability, assume we have m independent examples. Then, for a given hypothesis, the probability that m examples have not been misclassified is (1-e)m which we want to be less than δ. In other words, we want (1-e)m <= δ. Since for any 0 <= x <1, (1-x) <= e-x , we then have:

Pr (Pr ( ( ) ( )) )D X h x c x ε δ≠ ≥ ≤

Pr (Pr ( ( ) ( )) )D X h x c x ε δ≠ ≥ ≤

1 1ln( )m

ε δ≥

30

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Sample Size v.s. VC dimension

�Theorem 2 Let C be a nontrivial, well-behaved concept class. If the

VC dimension of C is d, where d < ∞, then for 0 < e < 1 and

any consistent function A: ScC is a learning function for C, and,

for 0 < e < 1/2, m has to be larger than or equal to a lower bound,

For any m smaller than the lower bound, there is no function A: ScH,

for any hypothesis space H, is a learning function for C. The sample

space of C, denoted SC, is the set of all m-samples over all c in C.

2 2

4 2 8 13max( log , log )

dm

ε δ ε ε≥

1 1max ln( ), (1 2 (1 ) 2 ))m d

εε δ δ

ε δ− ≥ ⋅ − − +

2 2min( 1, 1)d R n≤ Λ + +

Example: VC dimension of SVM

Page 16: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

16

31

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Noisy Samples

� Theorem 4 Let h < 1/2 be the rate of classification noise and N the number of rules in the class C. Assume 0 < e, h < 1/2. Then the number of examples, m, required is at least

and at most

r is the ratio of the required noisy training samples v.s. the noise-free training samples

212

ln( / )

(1 exp( (1 2 ) ))

N δε η⋅ − − −

2

ln(2 )max ,log (1 2 (1 ) 2 ))

ln(1 (1 2 ))m N

δε δ δ

ε η

≥ ⋅ − − + − −

2 112

(1 exp( (1 2 ) ))rη η −= − − −

32

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Training samples required when learning from noisy examples

� Ratio of the training samples required to achieve PAC-learnability under the

noisy and noise-free sampling environments. This ratio is consistent on

different error bounds and VC dimensions of PAC-learnable hypothesis.

Page 17: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

17

33

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Experiments -- example:

� We simulated annotation noises by

randomly change the positive

examples in manual annotations to

negatives.

� Because perfect annotation is not

available, accuracy is shown as a

relative ratio to the manual

annotations in [10].

� In this figure, we see the model

accuracy is not significantly

affected for small noises.

� A similar drop on the training

examples is observed at around

60% - 70% of annotation accuracy

(i.e., 30% - 40% of missing

annotations).

Business Unit or Product Name

© 2003 IBM Corporation34

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Challenges to Realize Autonomous Learning

� When there are no human supervision

– Associate the images/videos with semantic labels

– Our solution:

– Imperfect Labeling by unsupervised text-based search

– Find the “right” object; Reduce the influence of mislabeling

– Our solution:

– Uncertainty Pruning by Generalized Multiple Instance Learning

(GMIL) + Uncertainty Labeling Density

Page 18: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

18

Business Unit or Product Name

© 2003 IBM Corporation35

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Autonomous Learning Scheme

Input:

Named

Concept

Output:

Semantic

Visual

Models

Instances with

Relevance

Scores.

Bags of candidate images /regions

Dictionaries,

Internet

Cross-

Modality

Analysis

Imperfect

Labeling

Uncertainty

Pruning

Relevance

Modeling

GMIL

+ULDSVR

Database

• Uncertainty Pruning: - Generalized Multiple Instance Learning (GMIL)- Uncertainty Labeling Density (ULD)

• Relevance Modeling: - Support Vector Regression (SVR)

Business Unit or Product Name

© 2003 IBM Corporation36

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Imperfect Labeling

Goal: Leverage cross-modality correlation

to associate the image/video with labels

“First – let’s look at the national weather forecast…Unseasonably warm weather expected today …”

Page 19: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

19

Business Unit or Product Name

© 2003 IBM Corporation37

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Other Issues: Speech and Text-based Topic Detection

Query ConceptWeather News

weather, weather condition, atmosphericcondition

=> cold weather => fair weather, sunshine

=> hot weather

Weather conditionAtmospheric condition

Cold weatherFair weather

Weather forecastWarm weather

Speech-basedRetrieval

Weather

atmospheric

cold weather

freeze

frost

sunshine

temper

scorcher

sultry

thaw

warm

downfall

hail

rain

building

edifice

walk-up

butchery

apart build

tenement

architecture

call center

sanctuary

Bathhouse

animal

beast

brute

critter

darter

peeper

creature

Fauna

microorganism

poikilotherm

airplane

aeroplane

plane

airline

airbus

twin-aisle

airplane

amphibian

biplane

passenger

fighter

aircraft

Weather newsBuildingAnimalAirplane

� Unsupervised learning from WordNet:

Example of Topic – Related Keywords

Business Unit or Product Name

© 2003 IBM Corporation38

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Imperfect Labeling (cont.)

� For video sequences:

– Query expansion using

WordNet

– Rank video shots based on

Okapi algorithm

– Ranking � imperfect labels

� For web images

– Extract Labels from Existing

Search Engine

• i.e., Google image Search

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Airplane

Animal

Building

Face

Madeleine A

lbright

Nature Vegetation

Outdoors

People

Physical Violence Ro

ad

Sport Event

Weather News

Mean A

verage Precision

HJN

WordNet

Data Set: TREC-2003 video benchmark

Our Imperfect Labeling

is comparable with

supervised learning

(HJN)

Page 20: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

20

Business Unit or Product Name

© 2003 IBM Corporation39

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Experimental Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Airpla

ne

Animal

Build

ing

Face

Madele

ine A

lbrig

ht

Natu

re V

egetatio

n

Outd

oors

People

Physic

al Vio

lence

Road

Sport

Event

Weath

er New

s

Mean A

verage P

recis

ion

Supervised

Autonomous

Our Autonomous Learning

is comparable with

supervised learning(SVM)

� Training data

– “ConceptTraining”

in the NIST Video

TREC 2003 corpus

(78 video sequences

with 28,055

keyframes).

� Testing data

– “ConceptFusion1”

(13 news videos with

5,037 keyframes)

Business Unit or Product Name

© 2003 IBM Corporation40

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Experimental results

0.88990.61070.53390.7546GMIL

-ULD

0.86830.54670.41000.6250

Google

Image

Search

Madeleine

Albright

Hillary

Clinton

Newt

Gingrich

Bill

Clinton

Average

Precision

Dataset: Google Image Search Results

Page 21: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

21

Business Unit or Product Name

© 2003 IBM Corporation41

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Outline – Learning and Understanding from Multimodal Signals

iphttp

ntp

udp

tcpftp

rtp

rtsp

video

audio

• Motivation

• Understanding Multimodality Sensing Signals

• Learning from Multimodality Information

• Mining Large-Scale Multimodality Streams

• Conclusions

Video Semantic Understanding and Filtering

© 2004 IBM CorporationIBMPage 42

� 10Gbit/s Continuous Feed Coming into System

� Types of Data

• Speech, text, moving images, still images, coded application

data, machine-to-machine binary communication

� System Mechanisms

• Telephony: 9.6Gbit/sec (including VoIP)

• Internet

� Email: 250Mbit/sec (about 500 pieces per second)

� Dynamic web pages: 50Mbit/sec

� Instant Messaging: 200Kbit/sec

� Static web pages: 100Kbit/sec

� Transactional data: TBD

• TV: 40Mb/sec (equivalent to about 10 stations)

• Radio: 2Mb/sec (equivalent to about 20 stations)

What is the “large-scale” we are considering?

Page 22: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

22

Business Unit or Product Name

© 2003 IBM Corporation43

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Semantic MM Routing and Filtering

200-500MB/s ~100MB/sper PE

rates

10 MB/s

InputsDataflow

Graph

ip http

ntp

udp

tcp ftp

rtp

rtsp

sessvideo

sessaudio Interest

Routing

keywords id

Packet content

analysis

Advanced content

analysis

Interest

Filtering

Interested

MM streams

Business Unit or Product Name

© 2003 IBM Corporation44

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Activities Monitoring through Public Radio Signals

� Objective:

– Early understanding and monitoring on what’s happening

� Technical Challenges:

– Monitor many noisy channels simultaneously Understanding content

– Understanding speaker

– Understanding the relationships of speakers

– How information being flowed and how to organize

Page 23: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

23

45

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Methods for Speaker Identification

u

: observations

rSenderSenderReceiverReceiver

fFeaturesFeatures f

46

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Novel RadioDIG Algorithm – Building Dynamic Social Networks/Topics/Speaker Identification Simultaneously

u

: observations

S

r

SenderSender

ReceiverReceiver

fFeaturesFeatures

δ ςA

FeatureFeature--SenderSender

distributionsdistributions

f FeaturesFeatures

δςA

FeatureFeature--ReceiverReceiver

distributionsdistributions

Page 24: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

24

47

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Novel RadioDIG Algorithm – Building Dynamic Social Networks/Topics/Speaker Identification Simultaneously

u

w

: observations

z

α θA

N

S

βT

φr

SenderSender--TopicTopic

distributionsdistributions

TopicTopic--WordWord

distributionsdistributions

SenderSender

TopicTopic

WordWord

ReceiverReceiver

fFeaturesFeatures

δ ςA

FeatureFeature--SenderSender

distributionsdistributions

� Dynamic Baysian

Networks

� Based on MCMC

/ Gibbs Sampling

� Based on

Dirichlet Allocation

fFeaturesFeatures

48

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Human – a complex multimodality subject/object

� “Human and Social Dynamics (HSD)” is identified

as one of the five NSF key priorities among:

� Nanoscale Science and Engineering

� Biocomplexity in the Environment

� Human and Social Dynamics

� Mathematical Sciences

� Cyberinfrastructure

� (http://www.nsf.gov/news/priority_areas/)

Page 25: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

25

49

IBM Research

Ching-Yung Lin, IBM T. J. Watson Research Center © 2004 IBM Corporation

Great and Extensive Potential Impacts on Social Computing

� A deeper understanding of people’s routines and interactions will have significant impacts on…

� Designing public spaces and office environments, developing computer collaboration, recommendation, and assistance tools.

� Provide personalized and dynamic life pattern log, service/guidance

� Better anticipation of human and social changes (ex. causes & responses)

� Improvement of human interactions/ collaborations

� Prediction of information flow and efficient information spread

� Help risk assessment and decision-making

Business Unit or Product Name

© 2003 IBM Corporation50

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Low-cost Multimodality Sensors for Sleep Situation Inference / Logging

� Understand human night-time activity – Sleep

� What we have done:

– Using visual, audio, heartbeat, infrared sensors to monitor a person’s sleep patterns

– Measurement of sleep quality

– Logging and retrieval of sleep situation

� What we are going to do:

– Early detection and long term monitoring of sleep related diseases

Page 26: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

26

Business Unit or Product Name

© 2003 IBM Corporation51

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Modeling Dynamic Human Behavior and Properties

– Establish personal CommunityNet

– Establish personal ExpertiseNet

– Establish personal InterestNet

u

w

: observations

α θA

ND

βT

φ

z

r

Content-Time-Relation

model

S p*

γ ϕT

m

t

Business Unit or Product Name

© 2003 IBM Corporation52

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Smart Semantic Video Cameras

Face

Outdoors

Indoors

PE1: 9.2.63.66: 1220

PE2: 9.2.63.67

PE3: 9.2.63.68

Female

Male

Airplane

Chair

Clock

PE4: 9.2.63.66:1235

PE5: 9.2.63.66: 1240

PE6: 9.2.63.66

PE7: 9.2.63.66

PE100: 9.2.63.66

(Server) Concept Detection Processing

Elements

CDS

Features

(Distributed Smart Sensors) Block diagram of the smart

sensors

Meta-

data

600

bps

Control

Modules

Resource

Constraints

User

Interests

Display and

Information

Aggregation

Modules

1.5

Mbps

MPEG-

1/2GOP

Extraction

Event

Extraction2.8

Kbps

Feature

Extraction320

Kbps

22.4

Kbps

Encod-

ing

Sensor 1Sensor 2

Sensor NSensor 3

� Objectives: Build smart video sensors to execute real-time visual

reasoning and for autonomous machine cognition.

Page 27: Learning and Understanding from Multimodal Signalscylin/ibm/NTU-Multimodality-Talk.pdf · Learning and Understanding from Multimodal Signals December 23, 2005 Ching-Yung Lin Exploratory

27

Business Unit or Product Name

© 2003 IBM Corporation53

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Will this become possible ??

Smart Semantic Video Cameras

Business Unit or Product Name

© 2003 IBM Corporation54

IBM Research

Learning and Understanding from Multimodal Signals | Ching-Yung Lin | IBM © 2005 IBM Corporation

Questions?

� Thanks for attending!!

� http://www.research.ibm.com/people/c/cylin

[email protected]