lecture 3 video syntax analysis - national chung cheng

Wei-Ta Chu

2010/9/30

Video Syntax Analysis1

Multimedia Content Analysis, CSIE, CCU

Types of Shot Change


2

Abrupt change (hard cut) Cut occurs in a single frame when stopping and restarting the

camera Gradual transition

Fade-in: gradual increase in intensity starting from a black frame Fade-out: gradual decrease in intensity resulting a black frame Dissolve: transiting from the end of one clip to the beginning of

another Wipe: One image is replaced by another with a distinct edge

that forms a shape.…

Examples of Shot Changes


3

Li and Lee. “Effective detection of various wipe transitions” IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, no. 6, pp. 663-673, 2007.

Cut

Dissolve

Wipe

Examples of Fade


4

Cernekova, et al., “Information theory-based shot cut/fade detection and video summarization” IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 1, pp. 82-91, 2006.

Fade out

Fade in

Different Types of Wipe5

Li and Lee. “Effective detection of various wipe transitions” IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, no. 6, pp. 663-673, 2007.

Video example: http://en.wikipedia.org/wiki/Wipe_%28transition%29

Detection Process


6

Extractfeatures

Similaritycalculating

Boundarydecision

Video

Shot 1 Shot 2 Shot 3 Shot 4

Features


7

Pixel difference Statistical difference Histograms Compression differences Edge Motion

Pixel Difference


8

Count the number of pixels that change in valuemore than some threshold.

May be sensitive to camera motion.

1. Pair-wise comparison


9

Compare the corresponding pixels in two frames.

Problems: sensitive to camera movementE.g. camera panning Improvement: smoothing by a 3x3 window before

comparisonZhang, et al., “Automatic partitioning of full-motion video” Multimedia Systems Journal, vol. 1, pp. 10-28, 1993.

2. Histogram Comparison


10

Less sensitive to object motion, since it ignores thespatial changes in a frame.

Hi(j): the histogram value for the ith frame, where jis one of the G grey levels.

2. Histogram Comparison–Example11

Example video sequence

The intensity histogram ofthe first three frames

2. Histogram Comparison


12

Color histogram difference

pi(r,g,b) is the number of pixels of color (r,g,b) in frame Ii of N pixels.Each color component is discritized to 2B different values.

3. Likelihood Ratio


13

Compare corresponding regions (blocks) in two successiveframes based on second-order statistical characteristics oftheir intensity values.

Then a camera break can be declared whenever the totalnumber of sample areas whose likelihood ratio exceeds thethreshold is sufficiently large

Raise the tolerance of slow and small object motion from frameto frame.

mi: mean intensity value for a given regionSi: variances for a given region

4. Edge Change Ratio


14

Zabih, et al., “A feature-based algorithm for detecting and classifying scene breaks” Proc. Of ACM Multimedia, pp. 189-200,1995.

4. Edge Change Ratio


15

4. Edge Change Ratio16

Edge change ratio

5. Motion Vectors17

Using the direction of motionprediction to be the cues for shotchange detection

Pei, et al., “Scene-effect detection and insertion MPEGencoding scheme for video browsing and error concealment” IEEE Trans. on Multimedia, vol. 7, no. 4, pp. 606-614, 2005.

5. Motion Vectors


18

Using motion vector information to filter out falsepositives

Zhang, et al., “Automatic partitioning of full-motion video” Multimedia Systems Journal, vol. 1, pp. 10-28, 1993.

6. Differences in DCT domain


19

Discrete Cosine Transform (DCT) coefficients 1. Select subset of blocks 2. Select subset of DCT coefficients of these blocks 3. Concatenate selected coefficients of selected blocks as a

vector 4. Calculate the similarity of two coefficient vectors

Arman, et al., “Image processing on encoded video sequences” Multimedia Systems Journal, vol. 1, no. 5, pp. 211-219, 1994.

Gradual Transition Detection


20

Cuts or abrupt change

Gradual transition

1. Twin-Comparison Approach


21

Zhang, et al., “Automatic partitioning of full-motion video” Multimedia Systems Journal, vol.1, pp. 10-28, 1993.


Lienhart, R., “Comparison of automatic shot boundary detectionalgorithms” Proc. of SPIE Storage and Retrieval for Image and VideoDatabases VII, vol. 3656, pp. 290-301, 1999.

3. Characterizing a Wipe Transition


24

Evaluation


25

Precision The percentage of retrieved items that are desired items

Recall The percentage of desired items that are retrieved.

Precision =# Correctly retrieved items

# All retrieved items=

# Correctly retrieved items

# Correctly retrieved items + # Falsely retrieved items

Recall =# Correctly retrieved items

# All relevant items=

# Correctly retrieved items

# Correctly retrieved items + # Items that are not retrieved

Evaluation–Other Terms


26

Miss # Items that are not retrieved

True positive (TP) # Correctly retrieved items

False positive (FP) # Falsely retrieved items

True negative (TN) # Correctly missed items

False negative (FN) # Items that are not retrieved

Actualpositive

Actualnegative

Predictedpositive

TP FP

Predictednegative

FN TN

Evaluation


27

Actualpositive

Actualnegative

Predictedpositive

TP FP

Predictednegative

FN TN

Detected(retrieved)

Relevant(ground truth)

TPFP FN

TN

Relationship between Precision & Recall


28

Precision-Recall (PR) curve

Relationship between True Positive andFalse Positive


29

Receiver Operator Characteristic (ROC) curve

Using PR or ROC Curves?


30

ROC curves can present an overly optimistic view of analgorithm’s performance if there is a large skew in the class distribution.

Number of true negative examples greatly exceeds thenumber of positive examples. Thus a large change in thenumber in false positives can lead to a small change in thefalse positive rate.

Precision compares false positives to true positives and bettercaptures the algorithm’s performance.

Davis, et al., “The relationship between precision-recall and ROC curves” Proc. of International Conference on Machine Learning, pp. 233-240, 2006.

Comparison of Shot BoundaryDetection Techniques


31

MethodsHistograms, region histograms, running histograms,

motion-compensated pixel differences, DCT coefficientdifferences

Evaluation dataVideo type # Frames Cuts Gradual transitions

TV 133204 831 42

News 81595 293 99

Movie 142507 564 95

Commercial 51733 755 254

Misc. 10706 64 16

Total 419745 2507 506

Methods Compared


32

Histogram (64-bin gray-level) difference, single threshold Region (block) histogram

16 blocks, 64 gray-scale histograms, difference threshold for each block, and countthreshold for changed blocks

Running histogram (Twin method) 64 gray-scale histogram for each frame, twin thresholds Compute motion vectors. If excessive motion, reject gradual changes

Motion compensated pixel difference 12 blocks per frame, motion vector for each block Compute average residual errors, if larger than high threshold, detected as a cut Use cumulative errors to detect gradual changes (similar to above) Use motion vectors to reject false gradual changes

DCT difference Concatenate 15 coefficients of same locations from different blocks to form a vector Compute (1-inner product of two vectors from consecutive frames)

PR Curve for TV program


33

PR Curve for News program


34

PR Curve for Movie Videos


35

PR Curve for Commercials


36

PR Curve for All Data


37

PR Curve for All Data–Cut Only


38

Observations


39

Histogram-based method is consistent Produced the first or second best precision Simplicity & straightforward

Region algorithm seems to be the best Where recall is not the highest priority

Running algorithm seems to be the best Where recall is important Motion vector is helpful to reduce false positives

DCT the worst Large number of false positives in black frames

References


40

J.S. Boreczky, et al., "Comparison of video shot boundary detectiontechniques" Proc. of SPIE Conference on Storage and Retrieval forImage and Video Databases, vol. 2670, 1996. (must read)

R. Lienhart, "Comparison of automatic shot boundary detectionalgorithms" Proc. of SPIE Storage and Retrieval for Image andVideo Databases VII, vol. 3656, pp. 290-301, 1999.

J. Yuan, et al., "A formal study of shot boundary detection" IEEETrans. on Circuits and Systems for Video Technology, vol. 17, no. 2,pp. 168-186, 2007.

A. Hanjalic, "Shot-boundary detection: unraveled or resolved?" IEEETrans. on Circuits and Systems for Video Technology, vol. 12, no. 2,pp. 90-105, 2002.

Edge41


Edge42

An edge is a set of connected pixels that lie on the boundarybetween two regions.

Chapters 10 of “Digital Image Processing” by R.C. Gonzalez and R.E. Woods, Prentice Hall, 2nd

edition, 2001

Edge


43

Gradient Operators44

Roberts cross-gradient operators:

Prewitt operators:

Sobel operators:

Edge Examples


45

Edge Examples–after smoothing


46

Edge Examples


47

Canny Edge Detectors48

Step 1: the image is smoothed by Gaussian convolution Step 2: a 2D first derivative operator is applied to the

smoothed image Step 3: non-maximal suppression

Edges give rise to ridges in the gradient magnitude image. Thealgorithm tracks along the top of these ridges and sets to zero all pixelsthat are not actually on the ridge.

http://homepages.inf.ed.ac.uk/rbf/HIPR2/canny.htm

Very Brief Introduction of DiscreteCosine Transform

49


Spatial Frequency and DCT


50

Definition of DCT


51

2D DCT


52

1D DCT53

DCT Basis


54

DCT Basis


55

Example


56

Example


57

Example


58

Example


59

Discrete Cosine Transform


60

DCT converts a block of pixelsinto a block of transformcoefficients, which representthe spatial frequency.

Each coefficient is a weightapplied to an appropriatebasis function.

Any gray-scale 8x8 pixel blockcan be fully represented by aweighted sum of these 64 basisfunctions.

Increasing horizontal frequency

Increasingverticalfrequency

“DC” basis function

Intra-Frame Encoding (JPEG Compression)


61

Scene Transition Graph62


Yeung, et al. “Segmentation of video by clustering and graph analysis” Computer Vision and Image Understanding, vol. 71, no. 1, pp. 94-109, 1998.

Observations


63

Shots in a scene are often repetitive. We are ableto classify shots by grouping shots of similar visualcontents.

Often, a scene is made up of temporally adjacentshots indicating their interrelationships.

Similarity of Video Shots


64

D(.,.) measures the dissimilarity between two image frames.

Similarity of Video Shots65

Dissimilarity based on color histogram intersection

Dissimilarity based on luminance projection

Yeungand Liu, “Efficient matching and clustering of video shots” Proc. of IEEE International Conference on Image Processing,vol. 1, pp. 338-341, 1995.

Representative Image Setfor a Video Shot

66

Selection of representative set is achieved by nonlineartemporal sampling

Representative Image Setfor a Video Shot


67

Only 2 to 5% of frames are needed in comparisonto achieve good matching results.

In addition to temporal subsampling, spatialsubsampling can also be used to improve matchingefficiency.

Clustering of Video Shots


68

Shots in the same cluster are similar Any other shot outside of the cluster must have a

dissimilarity greater than the dissimilarity betweenany shot in the cluster.

Ci: the ith cluster



69

Dissimilarity between two clusters:

Using the shot pair, in which two shotsare in two different clusters, that hasthe largest dissimilarity value.

Dissimilarity between two clustersshould be updated at each iteration.



70

Time-Constrained Clustering71

Any two shots that are far apart in time, even if they sharesimilar visual contents, they potentially represent differentcontents or occur in different scenes.

Temporal distance between two shotsThe distance in number of framesfrom the end of the earlier shot to thebeginning of the latter one.

Scene Transition Graph


72

A scene transition graph is a directed graph with the propertyG=(V,E,F)

V: each node represents a cluster of shots E: a directed edge is drawn from node U to W if there is a

shot represented by node U that immediately precedes anyshots represented by node W.

F: a mapping that partitions the set of shots into clusters STG is able to represent compactly the structures of shots and

the temporal flow of the story for many video programs.

Example of STG


73

3 scenes of 9 shots

Sample clustering results

Scene transition graph

Cut Edges


74

An edge is a cut edge, if when is removed, results in two disconnected graphs.

Each partitioned STG Gi represents the interactions of shots in a story unit.

STG After Time Confining and CutEdges Finding


75

Framework76

Shot segmentation Time-constrained

clustering Building of scene

transition graph Scene segmentation

Influences of Parameters


77

Without the knowledge of how long each individual scene lasts,T cannot be approximated well. If T is too large, shots from different scenes are clustered together. If T is too small, shots in the same scene may be separated into

different scenes.

It’s less detrimental to have several story units represent a scene than to have one story unit represent several scenes.

Influences of Time Constraints78

T = 20s. dt(B1,B3) > T

Clustering results are {B1,B2},{A1,A2,A3},{B3,B4},{C1},{D1}

Story unit results are {B1,A1,B2,A2,B3,A3,B4},{C1},{D1}

B1B2

A1A2A3

B3B4 C1

D1

STG

{Bi} are not clustered into one cluster because thereare at least a pair of shots, one from each cluster, that has a temporaldistance dt > T*.

Influences of Time Constraints79

T = 20s.

Clustering results are {B1},{B2,B3},{A1,A2,A3},{B4},{C1},{D1}

Story unit results are {B1},{A1,B2,A2,B3,A3},{B4},{C1},{D1}

B1

A1A2A3

B2B3 C1

D1

B4

STG

Refined Analysis


80

Make the time-window more elasticCompute the duration of each story unit and adjust

Given a story unit, examination of the next storyunit by relaxing the temporal windows andreclustering the shots in these two units. If there exists at least one new cluster that contains

shots from the two units, two story units are merged intoone.

Refined Analysis


81

Example


82

{B1,B2},{A1,A2,A3},{B3,B4},{C1},{D1}

{B1},{B2,B3},{A1,A2,A3},{B4},{C1},{D1}

Results


83

STG constructed from the sitcom “Friends”. There are 35575 frames, each at a spatial resolution of 320x240.There are 313 shots.

Results


84

Time-constrained clustering of video shots is able to identifyindividual story units.

The resulting STG permits rapid nonlinear browsing of longvideo programs.

Variations of Clustering Parameters85

Smaller delta values result in more clusters and thus more story units.Users often prefer over-segmentation rather than under-segmentation.

Refining the Segmentation Results


86

The first two story units in Scene 1 are merged into one.The number of story units in Scene 6 is reduced from 4 to 2.

Conclusion


87

Analysis based on time-constrained clustering andscene transition graph analysis has contributed tothe extraction of story units.

The building of story structure provides nonlinearaccess to video contents.

Identification, integration, and application ofdomain-dependent and semantic features tend toimprove segmentation accuracy.

lecture 3 video syntax analysis - national chung cheng

Documents