watch listen & learn: co- training on captioned images and videos sonal gupta, joohyun kim,...
TRANSCRIPT
Watch Listen & Learn: Co-training on
Captioned Images and Videos
Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney.
Their endeavour :
• Recognize categories from natural scenes from images with captions
• Recognizing human actions in sports videos accompanied with commentary.
How do they go about it
• Their model learns to classify images and videos from labelled and unlabelled multi-modal examples.
• A semi-supervised approach.
• Use image or video content together with its textual annotation(captions or commentary) to learn scene and action categories.
Features:
• Visual Featureso Static Image features o Motion descriptors from videos
• Textual features
Histogram Of Oriented Gradients(HOG)• Divide image into small connected
regions(cells)
• Compile a histogram of gradient directions or edge orientation(This is the descriptor).
Gabor Filter
• Convolution of a sinusoid with a gaussian
• Texture detection
LAB color space
L(lightness)a and b are color opponent dimensions
• Includes all perceivable colors(exceeds
that of RGB and CYMk)
• device independent
• Designed to approximate human vision
Static Image Features
Each Image is broken into 4x6 grids
Record the mean, std deviation & skewness of per channel RGB
texture feature for each regions using gabor filters
Lab color pixel values
30 -dimensional feature vector for each of the 24 regions
K-means clustering Each region of each image is assigned to the closest cluster centroid.
N x 30
Bag of visual words
The final bag of visual words for each image consists of:
• A vector of k-values: The ith element represents the number of regions in the image that belongs to the ith cluster.
Motion descriptorsAt each feature point the patch is divided into 3x3x2 spatio-temporal blocks
4-bin HOG descriptors calculated for each block
laptev spatio temporal motion descriptors
72 element feature vector is obtained.
Motion descriptors from all video in the training set are clustered to form a vocabulary.
A video clip is represented as a histogram over this vocabulary
N x 72
Textual features
Image captions or transcribed video commentary
Preprocess:Remove stop words and stem the remaining.
The frequency of resulting word stem comprised the feature set.
Co- training
What is it?- A semi supervised learning paradigm
that exploits two mutually independent views.
Independent views in our case:
• Text classifier
• Visual classifier
Text classifier Visual Classifier
Text View Visual View +
Text View Visual View -
Text View Visual View -
Text View Visual View +
Initially labelled Instances
Supervised learning
Text classifier Visual Classifier
Initially labelled Instances
Visual View -
Visual View +
Visual View +
Visual View -
Text View +
Text View +
Text View _
Text View +
Classifiers used to learn from labeled instances
Co -train
Text classifier Visual Classifier
Text View Visual View
Text View Visual View
Text View Visual View
Text View Visual View
Unlabelled Instances
The trained classifier from the previous step used to label unlabelled instances
Supervised learning
Classify Most confident instances
Text classifier Visual Classifier
Partially labelled Instances
Visual View
Visual View
Visual View
Visual View
Text View
Text View
Text View
Text View
Supervised learning
Label all views in the instances
Text classifier Visual Classifier
Classifier labelled Instances
Visual View +
Visual View -
Visual View +
Visual View -
Text View +
Text View -
Text View +
Text View -
Retrain Classifier
Retrain based on the new labels
Text classifier Visual Classifier
Visual View +
Visual View -
Visual View +
Visual View -
Text View +
Text View -
Text View +
Text View -
Classify a new instance
Text View Visual View
Text View Visual View
Text view Visual View
Text View Visual view
Ready for Co- training
A Set of labelled and unlabeled examples each with two set of features(one for each view)
System
Two classifiers whose predictions can be combined to classify new test instances.
InputOutput
Classify Most confident instances
Early and Late Fusion
Early Fusion
Visual Features
Textual Features
Fused Vector
ClassifierCombined result used for labeling
test instances
Late Fusion
Visual Features
Textual Features
Visual Classifier
Text Classifier
Combined result used for labeling
test instances
Early and Late Fusion
Semi-supervised EM and Transductive SVMs
Semi-supervised Expectation Maximization
• Learns the probabilistic classifier from the labeled training data
• Performs EM iterations
• E-step – uses the currently trained classifier to probabilistically label the unlabeled training examples
• M-step – retrains the classifier on the union of labeled data and probabilistically labeled unsupervised examples
• Method of improving the generalization accuracy of SVMs by using unlabeled data
• Finds the labeling of the test examples that results in maximum margin hyperplane that separates the positive and negative examples of both training and testing data
Transductive SVMs
• Transductive SVMs are typically designed to improve the performance of the test data by utilizing its availability during training.
• It is also used directly in semi supervised setting where unlabeled data is available during training, which comes from the same distribution as the test data.
Transductive SVMs (contd)
Methodology
• For co-training,Support Vector Machine – Base classifier
for both image and text views
• We use Weka implementation of sequential minimal optimization (SMO)
• SMO is an algorithm for efficiently solving the optimization problem which arises during the training of SVMs
Parameters used in SMO-RBF Kernel (γ=0.01)-Batch size: 5
Confidence Threshold
Static Images
Videos
Image/Video view
0.65 0.6
Text view 0.98 0.9
Methodology (Continued)
Methodology (Continued)
• Ten iterations of ten-fold cross validation is performed to get smoother and reliable results.
• Test set is disjoint from both the labeled and unlabeled training data.
• Learning curves are used for evaluating the accuracy.
Learning Curves
• The learning curve can represent at a glance, the initial difficulty of learning something and, to an extent, how much there is to learn after initial familiarity.
• These curves are generated where at each point some fraction of the training data is labeled and the remainder is used as unlabeled training data
Results
Classify captioned static images:
• Image dataset : Israel dataset
• Images have short text captions
• Two classes : ‘Desert’ , ‘Trees’
Examples of images
DESERT
TREES
Ibex in Judean Desert Bedouin Leads His Donkey That Carries Load Of Straw
Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School
Co-training Vs Supervised Classifiers
Co-training Vs. Semi-Supervised EM Co-training Vs. Transductive SVM
Results(contd.)
Recognize actions from Commented videos:
• Video clips of soccer and ice-skating
• Resized to 240x360 resolution and then divided manually into short clips.
• Clip length varies from 20 to 120 frames.
• Four categories : kicking, dribbling, spinning, dancing
Examples of Videos
Examples of Videos(contd.)
Co-training Vs. supervised learning on commented video dataset
Co-training Vs. supervised learning when text commentary is not available
Limitations in the approach• Data set used is small and requires
only binary classification.
• Images that have explicit captions are used.
QUESTIONS ?????
THANK YOU
Joydeep SinhaAnuhya
KoripellaAkshitha
Muthireddy