improved video categorization from text metadata and user comments

Improved Video Categorization from Text Metadata and User Comments

ACM SIGIR 2011:Research and development in Information Retrieval

- Katja Filippova - Keith B. Hall

PresenterViraja Sameera Bandhakavi

1

Contributions• Analyze sources of text information like title,

description, comments, etc and show that they provide valuable indications to the topic

• Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant

• Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently

2

3

Research question not answered by related work

• Can a classifier learn from imperfect predictions of a weakly supervised classifier? Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one?

• Do the video and text based classifiers capture different semantics?

• How useful is user provided text metadata? Which source is the most helpful?

• Can reliable predictions be made from user comments? Can it improve the performance of the classifier?

3

4

Methodology

• Builds on top of the predictions of Video2Text• Uses Video2Text:– Requires no labeled data other than video metadata– Clusters similar videos and generates a text label for each

cluster– The resulting label set is larger and better suited for

categorization of video content on YouTube

4

5

Video2Text

• Starts from a set of weak labels based on the video metadata• Creates a vocabulary of concepts (unigrams or bigrams from

the video metadata)• Every concept is associated with a binary classifier trained

from a large set of audio and video signals• Positive instances- videos that mention the concept in the

metadata• Negative instances-videos which don’t mention the concept in

the metadata

5

6

Procedure• Binary classifier is trained for every concept in the vocabulary

– Accuracy is assessed on a portion of a validation dataset– Each iteration uses a subset of unseen videos from the validation set– The classifier and concept are retained if precision and recall are

above a threshold (0.7 in this paper)• The remaining classifiers are used to update the feature vectors of

all videos• Repeated until the vocabulary size doesn’t change much or the

maximum number of iterations is reached• Finer grained concepts are learned from concepts added in the

previous iteration• Group together labels related to news, sports, film, etc resulting in

the final set of 75 two level categories

6

7

Categorization with Video2Text

• Use Video2Text to assign two-level categories to videos

• Total number of binary classifiers (hence labels) limited to 75

• Output of Video2Text represented as a list of strings: (vi , cj, sij, )

7

8

Distributed MaxEnt

• Approach automatically generates training examples for the category classifier

• Uses conditional maximum entropy optimization criteria to train the classifiers

• Results in a conditional probability model over the classes given the YouTube videos.

8

9

Data and Models• Text models differ regarding the text

sources from which the features are extracted: title, description, comments, etc

• Features used are all token based• Infrequent tokens are filtered out to

reduce feature space• Token frequencies are calculated over 150K

videos• Every unique token is counted once per

video• Threshold token frequency of 10 is used • Tokens are prefixed with the first letter of

where it was foundeg: T:xbox, D:xbox, U:xbox, C:xbox, etc

9

10

Combined Classifier• Used to see if the combination of the two views –

video and text based, is beneficial• A simple meta classifier is used, which ranks the

video categories based on predictions of the two classifiers

• Video based predictions are converted to a probability distribution

• The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied

• This approach proved to be effective • Idea: Each classifier has a veto power• The final prediction for each video is the one with

the highest product score

10

11

Experiments- Evaluation of Text Models

• Training data set containing 100K videos which get high scoring prediction

• Correct prediction – score of at least 0.85 from Video2Text• Text based prediction must be in the set of video-assigned

categories• Evaluation was done on two sets of videos:– Videos with at least one comment– Videos with at least 10 comments

11

12

Experiments- Evaluation of Text Models Contd…

• The best model is TDU+YT+C for both sets• This model is used for comparison against Video2Text model

with human raters• This model is also used in the Combination model

12

13

Experiments with Human Raters• Total of 750 videos are extracted equally from the 15 YouTube

categories• Human rater rates (video, category) as -fully correct (3),

partially correct(2), somewhat related(1) or off topic (0) • Every pair received from 3 human raters• The three ratings are summed and normalized (by dividing by

9) and rounded off to get the resultant score

13

14

Experiments with Human Raters Contd…

• Score of at least 0.5 – correct category

• Text based model performs significantly better than video model

• Combination model improved accuracy• Accuracy of all models increases with number of comments

14

15

Conclusion• Text based approach for assigning categories to videos• Competitive classifier trained on high-scoring predictions made by a

weakly supervised classifier (video features)• Text and video models provide complementary views on the data• Simple combination model outperforms each model on its own• Accurate predictions from user comments • Reasons for impact of comments:

– Substitute for a proper title– Disambiguate the category– Help correct wrong predictions

• Future work: Investigate usefulness of user comments for other tasks

15

improved video categorization from text metadata and user comments

Documents