distributed video data fusion, analysis, and mining for video surveillance applications* edward...

Distributed Video Data Fusion, Distributed Video Data Fusion, Analysis, and Mining for Video Analysis, and Mining for Video Surveillance Applications* Surveillance Applications*

Edward Chang2 and Yuan-Fang Wang1

Department of Electrical and Computer Engineering2

Department of Computer Science1

University of CaliforniaSanta Barbara, CA 93106

*Supported in part by NSF Career, ITR, IDM, and Infrastructure grants, and a gift from Proximex Corp.

Problem Statement Video surveillance with

Multiple cameras Mobile, wireless networks Online data processing Intelligent, computer-assisted content analysis

Focus of current work Event Sensing for

detection representation, and Recognition of motion events

Sensor Network Management for Bandwidth and power resource conservation

Potential Applications and Needs

Applications Emergency search and rescue in natural disaster Deterrence of cross-border illegal activities Reconnaissance and intelligence gathering in

digital battlefields Needs

Rapid deployment, dynamic configuration, and continuous operations

Robust and real-time data fusion and analysis Intelligent event modeling and recognition

1x

1y1z

2x

2y

2z

mx

my

mz

X

Y

Z

TtZtYtXt ))(),(),(()( P

Ttytxt ))(),(()( 111 p

Internet

Slave station

Masterstation

Validation Scenario

Research and Development Framework

Event detection Far-field coordination and update Near-field sensor data fusion

Event representation Hierarchical – multiple levels of detail Invariant – insensitive to incidental changes

Event recognition Temporally correlated event signature Imbalanced training set

Event Detection: Near-field Sensor Data Fusion

Sensing coordination and intelligent data fusion

Two-level hierarchy of Kalman filter

Bottom level (feed forward) Summarize trajectories in local

state vectors Merge state vectors from multiple

cameras through registration parameters

Top level (feed backward) Fill in missing or occluded

trajectory pieces Camera pose & frame rate control

)0(

)0(

)0(

)0(

p

p

p

x

P

P

P

X

)()0( tz)()( tiz )()1( tmz

XTxworldreal

)0()0(

)0()0( xTX

realworld

XTxworldreal

mm

)1(

)1(

)1()1(

m

realworld m xTXInternet

Master fusion station

Slave stationSlave station

Slave station

)(

)(

)(

)(

i

i

i

i

p

p

p

x

)1(

)1(

)1(

)1(

m

m

m

m

p

p

p

x

Event Detection: Far-field Coordination and Update

Minimizing Bandwidth and power consumption under pre-specified accuracy constraints

Dual Kalman filters Update necessary only when

predications diverge Cache dynamic algorithms instead of

static data

Event Representation

Hierarchical Multiple levels of description

Syntactic level Semantic level

Invariant Descriptors unaffected by incidental changes of

environmental factors and camera pose Consequences

Be able to perform both “intra-class” and “inter-class” recognition

Recognize syntactic similarity (the same trajectory) and semantic similarity (the same type of trajectory)

Event Representation: Syntactic Level

Normalization against View point (Affine or

perspective) Speed

To derive an invariant signature

Event Representation: Semantic Level Segmentation based on acceleration Segment characterization Markov chain representation

?0P ierP no

?0V oyes constant? r

Stoppedyes no

Constantvelocity

Right spiral

yes no

yes no

Start

constant?

?0V o

?|| oVP

Left half turn

yes no

yes no

Slow down

?oVP

yes no

Right half turn

0)( zoVP

yes no

Right outwardturn

0)( zoVP

yes no

Rightinwardturn

0 oVP 0 oVP

Left outwardturn

Leftinwardturn

yes no yes no

?0V o

0/ dtd

yes no yes no

Right turn

Left turn

yes no

0/ dtd

Left spiral

yes no

Quickaccelerate

0 oVP

yes no

Quickstart

constant?

?0V o

?|| oVP

Left half Turn w.acc

yes no

yes noEmergency stop

?oVP

yes no

Right half turn w.acc

0)( zoVP

yes no

0)( zoVP

yes no

Rightoutwardturn w acc

0 oVP 0 oVP

yesno yes

no

?0V o

0/ dtd

yes no yes no

Left turn w. acc

yes no

0/ dtd

yes no

0/|| dtd r

Right half turn w.decel

yes

0/|| dtd r

Left half Turn w.decel

yesno no

0/|| dtd r

yes

0/|| dtd r

yesno no

0/|| dtd r

yes

0/|| dtd r

yesno no

Rightoutwardturn w decel

Rightinwardturn w acc

Rightinwardturn w decel

Leftoutwardturn w acc

Leftoutwardturn w decel

Leftinwardturn w acc

Leftinwardturn w decel

0/|| dtd r

yes

0/|| dtd r

yesno no

0/|| dtd r

yes

0/|| dtd r

yesno no

Left turn w. decel

Rightturn w. acc

Rightturn w. decel

Left turn w. acc

Left turn w. decel

RightTurn w. acc

Rightturn w. decel

Event Representation: Semantic Level (cont.)

Left half turn

Left half turn w. acc

Left half turn w. decel

Left outwardspiral

Left outward spiral w. acc

Left outward spiral

w. decel

Left inwardspiral

Left inward spiral w. acc

Left inward spiral

w. decelConstant velocity

Speed up

Slow down

Left half turn



Left outwardspiral


Left outward spiral

w. decel

Left inwardspiral


Left inward spiral

w. decel Constant velocity

Speed up

Slow down

Left half turn



Left outwardspiral


Left outward spiral

w. decel

Left inwardspiral


Left inward spiral


Speed up

Slow down

Left half turn



Left outwardspiral


Left outward spiral

w. decel

Left inwardspiral


Left inward spiral


Speed up

Slow down

Event Recognition: Sequence Data Learning

Similarity measurement difficult Sequence data with temporal correlation may

not have a vector space representation However, kernel methods (e.g., SVM) are

applicable No vector space representation OK But with feature space representation

Use DP algorithm for feature space distance metric Use hierarchical kernel recognition and fusion

Event Recognition: Imbalanced Data Set Negative samples significantly

outnumber positive samples Bayesian risk associated with

false negative significantly outweighs false positive

Adaptive conformal mapping at decision boundary

Event Recognition: Statistical Modeling

HMM is expensive to build

Not all behaviors are structured (e.g., loitering behaviors)

It may not be necessarily to understand individual activities before recognizing interaction

Distinguish interaction patterns Following Following-and-

gaining Stalking

Experimental Results: Syntactic Matching

Experimental Results: Semantic Indexing

Experimental Results: Biased Learning

=TP/(TP+FN)

=TN/(TN+FP)

threshold

penalty

Experimental Results: Statistical Learning

Results

Relevant Publications

Many details are omitted Sensor registration (spatial and temporal) Object tracking (Kalman and multi-state) Power management and routing

1. L. Jiao, G. Wu, Y. Wu, E. Y. Chang, and Y. F. Wang, “The Anatomy of A Multi-Camera Video Surveillance System,'' to appear in the ACM Multimedia System Journal.

2. K. Wu, J. Long, D. Han, and Y. F. Wang, “Human Activity Detection and Recognition for Video Surveillance,” Proceedings of IEEE International Conference on Multimedia Computing and Systems, 2004.

3. Edward Chang and Yuan-Fang Wang, "Toward Building a Robust and Intelligent video Surveillance System: A Case Study," (invited paper) Proceedings of the IEEE Multimedia and Expo Conference, Taipei, Taiwan, 2004.

4. R. Rangaswami, Z. Dimitrijevic, K. Kakligian, Edward Chang, and Yuan-Fang Wang, "The SfinX Video Surveillance System," Proceedings of the IEEE Multimedia and Expo Conference, Taipei, Taiwan, 2004.

5. G. Wu, Y. Wu, L. Jiao, Y. F. Wang, and E. Y. Chang, `”Multi-camera Spatio-temporal Fusion and Biased Sequence-data Learning for Security Surveillance,'' Proceedings of ACM Multimedia Conference, Berkeley, CA, 2003.

6. K. Wu, J. Long, D. Han, and Y. F. Wang, “Real-Time Multi-person Tracking in Video Surveillance,” Proceedings of the Pacific Rim Multimedia Conference, Singapore, 2003.

7. Y. Wu, L, Jiao, G. Wu, E. Chang, and Y. F. Wang, “Invariant Feature Extraction and Biased Statistical Inference for Video Surveillance,” Proceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance, Miami, FL, 2003.

Focus of This Seminar

Video-based face tracking, modeling and recognition

Human activity and interaction analysis

Video-Based Face Tracking & Recognition

Image-based Image normalization Feature selection Face recognition

Video-based Face region detection Tracking Face modeling and recognition

Difficulties

Quality of video is low Large illumination, pose variation, occlusion

Face images are small Compared to still image-based system

Model construction and fitting Generic vs. personal-specific 2D vs. 3D

Proposed Approach: Resolution Enhancement

Exploit multiple image frames and spatial coherency Single camera super-resolution (digital zoom) Multi-camera (master-slave) face region detection and

zooming (optical zoom) Need feature appearance (PCA + LDA) and

geometrical relations

General Framework: Visual Servoing

A Feedback control mechanism Reference and real signals are computed

from images

- J-1 Camera Control +

External Disturbance

New Image

FeatureDection

Referencesignal

Realsignal

Errorsignal

Controlsignal

Master-Slave Combo Setup

slaveslaveslaveworldworldmastermaster

slaveworldslaveworldmastermasterworldworldslaveslave

worldworldmastermaster

zf

ff

pTTp

pTTpPTp

PTp

),,,(

),,(),,(

1

X

Y

Z

X

X

Y

YZ Z

fslavep

),,,( slaveslaveworld zfT

worldmasterT

Mater: Anatomy-Guided Face Modeling

Face region localization based on anatomy Face region detection based on skin color

segmentation Face region modeling based on ellipse fitting Face region tracking using mean-shift tracker

X

YZ

worldmasterT

X

Y

Z

Slave: Master-Guided Zooming

X

Y

Z

X

X

Y

YZ Z

fslavep

),,,( slaveslaveworld zfT

worldmasterT

What’s Next?

View-based recognition Frontal-view detection Multi-frame evidence aggregation 3D model (?)

Single Camera Super resolution

Multiple, spatially-coherent frames as down-sampled, low-resolution (LR) images of original high-resolution (HR) images

Mathematically

)(

,)(

2,)(

1,)(

,1)(

12)(

11

)()(2

)(1

)(1

)(12

)(11

kncmc

kmc

kmc

knc

kk

kmn

km

km

kn

kkk

kkkkk

IIIIII

IIIIII

I

I

nITBDI

Three components: Spatial registration function

(T) Blurring function (B) Down-sampling function

(D) c: down-sampling factor

Spatial Registration Function

Modeled as affine transform Capture translation, rotation, and zooming In reality, only translation motion has been

successfully demonstrated

yyy

xxxk cba

cbaT

Blurring Function

Modeled as Gaussian kernel Caveats:

point spread function (blurring) function may not be known and is wave-length dependent

Diffraction effect induces ripples and is better modeled with Besel functions

Numerical Solution

Large system of equations Require preconditioning

Not sure that it will work in the real world Simpler mechanism (e.g., bi-linear

interpolation) exists with inferior performance

Optical zoom instead of digital zoom

Schedule 9/29: overview 10/6: Dan: face recognition overview 10/13: no meeting (research travel) 10/20: Dr. Kang 10/27: 11/3: 11/10: 11/17: 11/24:

Video-based face modeling and recognition Super resolution

Multiple images Space-time

Human activity/interaction analysis

Video-based face modeling and recognition Super resolution

Multiple images Space-time

Human activity/interaction analysis

distributed video data fusion, analysis, and mining for video surveillance applications* edward...

Documents