drinking from a firehose: large-scale search in consumer...

59
Drinking from a Firehose: Large-Scale Search in Consumer-Produced Videos Dr. Gerald Friedland Director Audio and Multimedia Research [email protected]

Upload: others

Post on 16-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Drinking from a Firehose: Large-Scale Search in Consumer-Produced Videos

Dr. Gerald Friedland Director Audio and Multimedia Research [email protected]

Page 2: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

! International Computer Science Institute !Focused on fundamental, open Computer Science

research !Nonprofit, private institute affiliated with UC

Berkeley ! Inaugurated in 1988 based on a German-

Government initiative to exchange academic personnel (students, postdocs, faculty) and improve German computer science

What is ICSI?

Page 3: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Audio and Multimedia Research at ICSI

3

Two main themes: •Social media retrieval

•Concept Search •Location Estimation

•Privacy

Page 4: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Multimedia in the Internet is Growing

4

Page 5: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Consumer-Produced Videos are Growing

5

• YouTube claims 65k 100k video uploads per day or 48 72 hours every minute

• Youku (Chinese YouTube) claims 80k video uploads per day

• Facebook claims 415k video uploads per day!

Page 6: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Why do we care?

• Consumer-Produced Multimedia allows empirical studies at never-before seen scale in various research disciplines as well as industrial endeavors. !

• Recent buzzword: BIGDATA !

!6

Page 7: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

7

Results: Google

Page 8: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Example: Audio Concept Retrieval

8

Ball soundMale voice (near)

Child’s voice (distant)Child’s whoop (distant)

Room tone

Cameron learns to catch (http://www.youtube.com/watch?v=o6QXcP3Xvus)

Page 9: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

• Which city was this recorded in? Pick one of: Amsterdam, Bangkok, Barcelona, Beijing, Berlin, Cairo, CapeTown, Chicago, Dallas, Denver, Duesseldorf, Fukuoka, Houston, London, Los Angeles, Lower Hutt, Melbourne, Moscow, New Delhi, New York, Orlando, Paris, Phoenix, Prague, Puerto Rico, Rio de Janeiro, Rome, San Francisco, Seattle, Seoul, Siem Reap, Sydney, Taipei, Tel Aviv, Tokyo, Washington DC, Zuerich

• Solution: Tokyo, highest confidence score! !

An Experiment

Listen!

Page 10: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Problem

• Videos need to be searchable beyond keyword tags. Ideally we want to be able to search for any existing mental concept. !

• How do we make large-scale concept search possible?

10

Page 11: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Task: Video Concept Search

11

• Given: – a set of concepts described by a set of

example videos • Wanted:

– A machine representation (i.e. model) that allows the detection of videos that belong to one of the pre-defined concepts

– A classification as of to which concept a detected video belongs

Page 12: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

NIST TrecVID MED

12

• Videos all “consumer produced” , typically 1-5 minutes long

• Given 30 concepts (2012), 10 for training, 20 for eval

• About 100 sample videos per concept • Testset 2012: 150k videos, open set • Error Metric: DET curve

Page 13: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

2011 Concepts

13

Event Category Train DevTest E001 Board Tricks 160 111E002 Feeding Animal 160 111E003 Landing a Fish 122 86E004 Wedding 128 88E005 Woodworking 142 100E006 Birthday Party 173 0E007 Changing Tire 110 0E008 Flash Mob 173 0E009 Vehicle Unstuck 131 0E010 Grooming animal 136 0E011 Make a Sandwich 124 0E012 Parade 134 0E013 Parkour 108 0E014 Repairing Appliance 123 0E015 Sewing 116 0Other Random other N/A 3755

Page 14: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

2012 New Concepts

14

Event Category

E021 Attempting a Bike Trick

E022 Cleaning an Appliance

E023 Dog Show

E024 Giving Directions to a Location

E025 Marriage Proposal

E026 Renovating a Home

E027 Rock Climbing

E028 Town Hall Meeting

E029 Winning a Race without a Vehicle

E030 Working on a Metal Crafts Project

Page 15: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Sample Video 1: “Board Tricks”

15

Page 16: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Sample Video 2: “Board Tricks”

16

Page 17: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Test Video: “Board Tricks”

17

NOT A POSITIVE!

Page 18: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Prior Work: TrecVid MED 2010

18

Page 19: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Related Work: TrecVid MED 2010

19

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subhabrata Bhattacharya,!Dan Ellis, Mubarak Shah, Shih-Fu Chang: Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, Proceedings of TrecVid 2010, Gaithersburg, MD, December 2010.

Page 20: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

General Observations

20

• Classifier Ensembles problematic: – Which classifiers to build? – Training data? – Annotation? – Idea doesn’t scale... or does it?

Alexander Hauptmann, Rong Yan, and Wei-Hao Lin: “How many high-level concepts will fill the semantic gap in news video retrieval?”, in Proceedings of the 6th ACM international conference on Image and Video retrieval, CIVR ’07, pages 627–634, New York, NY, USA, 2007. ACM.

Page 21: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

General Observations

21

Hauptmann et al.: How many classifiers are needed for a good retrieval system?

Page 22: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

General Observations

22

• Speech Recognition (incl. keyword spotting) mostly infeasible (50% of videos contain no speech, speech is arbitrary languages, quality varies

• Acoustic event detection same issues as visual object recognition

• Music indicative of concept but not with high confidence

• Single-Model Approach problematic: – Too noisy – Not explicable

Page 23: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

What is a Concept?

23

Back to Philosophy: A concept is created by abstracting, drawing away, or removing the uncommon characteristics from several particular ideas. The remaining common characteristic is that which is similar to all of the different individual ideas. (John Locke, 1632-1704)

Page 24: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts

24

Another hint from philosophy: Concepts without percepts are empty; percepts without concepts are blind. (Immanuel Kant, 1724-1804) !

Page 25: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts

25

Definition: an impression of an object obtained by use of the senses. (Merriam Webster’s) !

•Well re-discovered in robotics btw...

Page 26: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Idea

26

Based on the given example videos for each concept, let the computer figure out the events/objects that are inherently and exclusively observeable in a concept. !

Page 27: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

ICSI’s Audio Contribution: Idea

27

•Extract “audible units” aka percepts. !

•Determine which percepts are uncommon across concepts but common to the same concept.

•Detect video concepts by detecting common percepts.

Page 28: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Conceptual System Overview

28

Percepts Extraction

Audio Signal

Percepts Weighing

Classification

Concept (test)

Concept (train)

Page 29: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts Extraction

29

• Based on ICSI Speaker Diarization System, but: – Speech/Speaker specific components

removed – Tuned for generic event diarization – Super-fast implementation from UC

Berkeley’s PARLAB • Separate processing for music/speech/

non-speech

Page 30: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1 Cluster2 Cluster1 Cluster2

End

No

•Start with too many clusters (initialized randomly) •Purify clusters by comparing and merging similar clusters •Resegment and repeat until no more merging needed

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster2

30

Percepts Extraction

Page 31: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts Extraction

31

• High number of initial segments • Features: MFCC19+D+DD+MSG • Minimum segment length: 30ms • Train Model(A,B) from Segments A,B

belonging to Model(A) and Model(B) and compare using BIC:

EXAMPLE SYSTEM COMPONENTS 525

down”. The former incorporates agglomerative hierarchical in wclustering, in which aninitially large number of clusters are gradually merged to increase (or decrease) somechosen metric, using a stopping criterion to determine when to discontinue the mergingprocess. The latter uses divisive clustering, which starts with a small number of initialclusters (sometimes one) and performs splits according to a metric, using a stoppingcriterions to determine when to stop splitting. In both cases, the goal is to achieve anoptimum number of clusters (ideally corresponding to the number of speakers), and also todetermine the start and end points for each segment in each cluster.

The most common measure of goodness for a particular choice of merging twosegments (or splitting a single one into two) is the so-called Bayesian Information Criterion(BIC) [5]. The quantity to be compared between a single segment and two separate segmentsis

log p(X |Q)� 12

lK log N (42.2)

where X is the sequence of speech features in the segment (such as mel cepstra), Qare the parameters of the statistical model for the segment, K is the number of parametersfor the model, N is the number of speech feature vectors (e.g., frames) in the segment,and l is an optimization parameter, which ideally should be 1.0, but in reality is tuned forgood results based on experience. Even if the number of parameters in each comparison areconstrained to be the same, there are still some hyperparameters to be chosen. For instance,the number of initial segments (for the bottom-up approach) and the initial number ofGaussians per model can both be critical choices. The type of initialization used for thefirst segmentation can also be critical.

Note that the first term is simply the log likelihood of the segment, while the secondterm accounts for complexity. Without the second term, the optimum value would simplybe one that had the largest number of segments (and parameters), since this would best fitthe data. However, the presence of the tuning term l is a bothersome limitation, since itsproper setting depends on the availability of a comparable development set. In [1], it wasproposed that the number of parameters be kept constant between the models that would becompared (e.g., between ascribing two segments to a single model or to two different ones).This would mean that the second term would be irrelevant. In other words, as long as thenumber of parameters are kept the same, models can be compared based on log likelihoodalone and still fulfill the BIC.

Most commonly, the statistical model used to represent each cluster is a GaussianMixture Model (GMM). As described in earlier chapters, this is a weighted sum of Gaussiandistributions, where most commonly each Gaussian is parameterized by a mean vector anda diagonal only covariance matrix. The underlying model for the entire speech sample istypically a Hidden Markov Model, where each state corresponds to a cluster (representedby a GMM), and so the actual segmentation for each iteration of the segmentation isdetermined by a Viterbi realignment. Since speaker turns usually don’t occur every frame,a minimum duration constraint is enforced for each speech segment, a typical value is 2.5seconds, to make sure speakers are clustered, not phones or other units. The following

Page 32: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts Extraction: Technical Questions

32

• Which Features? • Which Models? • How to Compare Models? • Minimum Segment Length? • etc... !

Page 33: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Percepts Dictionary

33

!•Percepts extraction works on video-by-video basis •Use clustering to unify percepts across videos in one concept and build prototype percepts •Represent videos by supervectors of prototype percepts = “words”

Page 34: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Questions...

34

•How many unique Percepts define a Concept?

•What’s the occurence frequency of the unique Percepts per Concept?

•What’s the cross-Concept ambiguity of the Percepts?

•How indicative are the highest frequent Percepts of a Concept?

Page 35: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Properties of “Words”

35

• Sometimes same “word” describes more percepts (homonym)

• Sometimes same percepts are described by the different “words” (synonym)

• Sometimes multiply “words” needed to describe one percepts => Problem? !

Page 36: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Distribution of “Words”

36

Histogram of top-300 “words”.

Long-Tailed Distribution

Page 37: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Empirical Observations

37

•Hauptmann et al. assumed Zipf distribution of Concepts in Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. How many high-level concepts will fill the semantic gap in news video retrieval? In Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR ’07, pages 627–634, New York, NY, USA, 2007. ACM.

•Lu & Hanjalic found evidence for Zipf distribution of audio features in Lie Lu and A. Hanjalic. Audio keywords discovery for text-like audio content analysis and retrieval. Multimedia, IEEE Transactions on, 10(1):74–85, jan. 2008.

Page 38: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Empirical Observations

38

•Chadhuri et al. found a Zipf distribution after segmenting TRECVid videos in Sourish Chaudhuri, Mark Harvilla, and Bhiksha Raj. Unsupervised learning of acoustic unit descriptors for audio content representation and classification. In INTERSPEECH, pages 2265–2268, 2011.

•Huang et al. used method based on Zipf distributions for TrecVID MED11 in Po-Sen Huang, Robert Mertens, Ajay Divakaran, Gerald Friedland, and Mark Hasegawa-Johnson. How to Put it into Words – Using Random Forests to Extract Symbol Level Descriptions from Audio Content for Concept Detection. In Proceedings of IEEE ICASSP, pages AASP–P8.10, March 2012.

Page 39: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Empirical Observations

39

•Mertens et al. documented a Zipf distribution after diarizing TRECVid videos in Robert Mertens, Po-Sen Huang, Gerald Friedland, and Ajay Divakaran. On the applicability of speaker diarization to audio indexing of non-speech and mixed non-speech/speech video soundtracks. International Journal of Multimedia Data Engineering and Management, to appear.

Page 40: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

TF/IDF on Supervectors

40

• Zipfian distribution already observed before by Alex Hauptmann

• Zipfian distribution allows to treat supervector representation of percepts as “words” in a document.

• Use TF/IDF for percepts weighing • Back to TREC! !

Page 41: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Recap: TF/IDF

41

•TF(ci, Dk) is the frequency of “word” ci in concept Dk. •P(ci = cj|cj ϵ Dk) is the probability that “word” ci equals cj in concept Dk •|D| is the total number of concepts •P(ci ϵ Dk) is the probability of “word” ci in concept Dk

Page 42: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

SVM Classifier

42

• SVM takes 300 vectors of (supervectors,TF-IDF values) as input

• Use Intersection Kernel (IKSVM) • Training with 15 concepts !!

Page 43: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

System Overview

43

Percepts Extraction

MultimediaDocument

Percepts Selection Classification

Concept (test)

Concept (train)

Diarization &K-Means

Audio Track TFIDF SVM

Concept (test)

Concept (train)

Framework:

Realization:

Page 44: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Processing Time

44

!

!

!

!

!

3 weeks (practically) for 150k videos on speech group compute cluster.

30 sec 3 min

1 hour 300 files 1/2 hour 300 files

5 min 300 files

Page 45: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Audio-Only Detection on MED-DEV11

45

Error at FA=6%: Miss = 58%

Page 46: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Audio-Only Detection on MED-DEV11

46

Error at FA=6%: Miss = 58%

Surpassed Year-1 ALADDIN goal audio-only! (goal: 75% miss at 6% FA)

Page 47: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Eval Result MED-11 (Complete System)

47

Surpassed Year-2 ALADDIN goal! (goal: 50% miss at 5% FA)

Page 48: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Let there be Zipf

48

•Let’s assume the distributions of Percepts per Concept follows a ranking function:

with k rank (sorted by highest to lowest frequency), s=1, N number of Percepts.

f(k, s,N) = 1/ks

PNn=1(1/n

s)

Page 49: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Observations

49

•It follows the CDF is: with k rank (sorted by highest to lowest frequency), s=1, N number of Percepts and

CDF (k, s,N) = Hk,s

HN,s

Hn,m =Pn

k=11

km

Page 50: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

A Note

50

•There is no need to commit to a certain Language (e.g. Human languages)

•Given the sparsity of concepts (as of TrecVID) committing to a certain language will most likely make things more complicated.

Page 51: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Implications

51

Using a well-defined distribution of Percepts to Concepts, one can •Predict the observeability of a certain

Percept given a Concept •Measure Model purity, thereby identify

ambiguities and threshold noise •Find the most representative and

indicative Percepts

Page 52: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Properties of Zipfian “Words”

52

• Distribution allows to distinguish key-percepts from noise: A lot less data is better for training! !

Top N Actual Hits Predicted Hits Error Ambiguity1 17% 16% 1% 0%3 35% 30% 5% 0%5 46% 36% 10% 20%10 56% 46% 10% 24%20 84% 57% 27% 27%40 99% 68% 31% 31%

Table 1: Predicted descriptiveness of the top-N percepts fordifferent values of N in comparison to the measurements onTRECVID MED11 as well as observed model impurity in %.While it remains unclear how complete the models are, ourtheoretical prediction is roughly within the range of actual hitsconsidering the ambiguity.

plified supervectors, resulting in clusters that represent prototypeclusters, which we define as being our percepts.

The entire video concept detection system based on the frame-work outlined above is show in Figure 1 in comparison with theconceptual approach outlined in previous sections. The diarizationand K-means step represents the percepts extraction. Each conceptis represented by 300 percepts, which we used to perform the ex-periments described in the next section. The GMMs correspondingto the percepts are then used to detect the same percepts in the audiotracks of the test videos. This allows a direct mapping comparisonbetween the percepts in the training and test set. The top-N per-cepts selection follows a TFIDF approach [6] and a Support VectorMachine (SVM) is used to perform the final classification. TheSVM classification is described in detail in [3], however, withoutthe proceeding steps.

7. ANALYSIS OF RESULTSAs a first experiment, we want to verify that our framework ac-

tually can estimate the descriptiveness for the top-n percepts. Wefocused our experiments on a smaller set of videos containing thesame 5 classes for train and test, with a total of 662 clips of train-ing data, and 492 clips of test data. As discussed in the previousSection, we extracted 300 percepts for each concept in the trainingset. After that, we extracted 300 percepts for each test video. Inthis experiment, we control for the class both in test and trainingset in order be able to match the percepts perfectly between testand training. We then wanted to know how many videos in the testset actually match the training percepts in their perspective class,given a reduction to the top-n highest frequent percepts. In otherwords, assuming the training videos in MED 11 as a complete setof percepts for each concept (perfect model), how well are the per-cepts represented in the test set. Table 1 shows the results of theexperiment. We show the predicted descriptiveness of the top-npercepts for different values of n as calculated using Observation 2and compare it to the empirically observed data. The measurementsare obtained by counting the matching videos for all concepts vs.the non-matching. In other words, the top-1 column shows howmany videos could in theory already be classified just based on thetop-1 occurring percept. The ambiguity is determined by countingthe number of homonymous percepts. As can be seen, the predic-tion error is pretty low and correlates with the ambiguity, whichprovides evidence for the validity of our framework. Please note,real-world audio percepts only approximate the Zipfian distributionand the models are not complete.

As a second experiment, we wanted to show that classificationbased on the top-n percepts, will improve classification accuracydramatically, therefore providing evidence for our main hypothe-sis: There is no data like less data. For this experiments, we se-lected only training videos that contain a) the top-20 percepts (asdetermined by TFIDF), b) 20 random percepts and c) the low-20

Error Baseline Top 20 Low 20False Alarm 6% 6% 6%Miss 72% 66% 79%EER 31% 31% 35%

Table 2: Change of classification error using FA/Miss as definedby TRECVID MED and using Equal-Error-Rate for no per-cepts selection, top-20 frequent percepts and low-20 frequentpercepts. Even though our method does not yet account forambiguity, less data is better than all data.

percepts (as determined by TFIDF). We trained the SVM classifi-cation system overviewed in Section 6 (and detailed in [7, 3]) usingthe three options and measured the classification accuracy. Table 2shows the results. Even though we only selected on the level ofvideos rather than percepts and so not all ambiguities and noisehas been filtered, the classification accuracy changes dramaticallybased on the selection of top percepts. Please note, that the num-bers are not comparable with related work because the system wasnot tuned in any way as the only goal was to prove our point. Alsonote that the results are based exclusively on audio and averagedover all 5 classes.

8. CONCLUSION AND FUTURE WORKWe presented a theoretical framework that allows to reason quan-

titatively and qualitatively about video concept detection even whenmultimedia documents have no obvious structure. Our frameworkis media independent and our supporting experiments were con-ducted on a publicly-available large set of consumer-produced videos.Future work includes extending the framework and learning morefrom the natural language community and the original TREC eval-uations (that later resulted in TRECVID). Our next steps will alsoinclude building an actual system that will utilize the predictivepower of our framework in different modalities by eliminating noisypercepts. A limitation of the approach is that percepts do not nec-essarily overlap with human percepts, therefore, introspection andanalysis of concept detection results by human requires an inter-mittent step: a translator from machine-generated percepts-conceptmappings, i.e. languages, to human languages.

AcknowledgementsSupported by the Intelligence Advanced Research Projects Activ-ity (IARPA) via Department of Interior National Business Centercontract number D11PC20066. The U.S. Government is autho-rized to reproduce and distribute reprints for Governmental pur-poses notwithstanding any copyright annotation thereon. The viewsand conclusion contained herein are those of the authors and shouldnot be interpreted as necessarily representing the official policies orendorsement, either expressed or implied, of IARPA, DOI/NBC, orthe U.S. Government.

9. REFERENCES[1] Sourish Chaudhuri, Mark Harvilla, and Bhiksha Raj.

Unsupervised learning of acoustic unit descriptors for audiocontent representation and classification. In INTERSPEECH,pages 2265–2268, 2011.

[2] Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. Howmany high-level concepts will fill the semantic gap in newsvideo retrieval? In Proceedings of the 6th ACM internationalconference on Image and video retrieval, CIVR ’07, pages627–634, New York, NY, USA, 2007. ACM.

[3] Po-Sen Huang, Robert Mertens, Ajay Divakaran, GeraldFriedland, and Mark Hasegawa-Johnson. How to Put it into

Page 53: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Properties of: Zipfian “Words”

53

• Distribution allows prediction of “completeness” of training data

Page 54: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Visualization of Zipfian Percepts

54

• Top-1 percepts very representative of concept.

Page 55: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Conclusions

55

• Video = Audio + Video and audio has a lot of information that can be processed fast.

• Traditional paradigms need to be rethought for big data processing, rather than just scaled up.

Page 56: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Future Work

56

• Many knobs to tune... • Reduce ambiguities in percepts

extraction • Exploit temporal dimension better:

(“sentences”, “paragraphs”?) • Exploit multimodality early • Add supervised step when training

videos too incomplete

Page 57: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Yahoo/Livermore/ICSI Dataset • Raw Data

– 99M images – 1M videos – title, description, tags, uploader, social

graph, time of upload – about 10%: geo-tagged

• Features – Audio: MFCC, pitch, energy – Visual: GIST, sparse SIFT, HoG, Tempura,

color histogram, motion

57www.yli-corpus.org

Page 58: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

58

Yahoo/Livermore/ICSI Dataset

Page 59: Drinking from a Firehose: Large-Scale Search in Consumer ...files.meetup.com/1331888/SearchInVideos-SemanticWebMeetup.pdf · Drinking from a Firehose: Large-Scale Search in Consumer-Produced

Thank You!

59

Questions?