translating related words to videos and back through latent topics
TRANSCRIPT
Translating Related Words to Videos and Back through Latent
Topics
Pradipto Das, Rohini K. Srihari and Jason J. CorsoSUNY Buffalo
WSDM 2013, Rome, Italy
WiSDoM is beyond words
Master Yoda, how do I find wisdom from so many things happening
around us?
Go to the center of the data and find your wisdom you will
WiSDoM is beyond words
Master Yoda, how do I find wisdom from so many things happening
around us?
Go to the center of the data and find your wisdom you will
Big data problem – lots of
data around us but which
ones are meaningful?
Need statistics from the data
that meaningfully encode
multiple views
How do the centers look like?
parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground
lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic
floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck
make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody
Be careful on what people do with their sandwiches!Interviews indoors can be tough!
parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground
lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic
floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck
make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody
Be careful on what people do with their sandwiches!Interviews indoors can be tough!
tutorial: man explains how to make lobster rolls from scratch
One guy is making sandwich outdoors
Man performs parkour in various locations
montage of guys free running up a tree and through the
woods
interview with parkour contestants
Man performs parkour in various
locations
Kid does parkour around the park
Footage of group of performing parkour outdoors
A family holds a strange burger assembly and wrapping contest at Christmas
The actual ground-truth synopses overlaid
Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011
F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
Topic Model (LDA) Regression
Training Testing
There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed
Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description
Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from
Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011
F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011
Topic Model (LDA) Regression
Training Testing
There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed
Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description
Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from
Giving back to the community: Driverless cars are already helping the
visually impaired to drive around It will be good to enable visually
impaired drivers to hear the scenery in front
In a non-
invasive way!
Do we speak all that we see?
1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
2. A man is bouldering at an indoor rock climbing gym.3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.
Centers of attention (topics)
1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
Hand holding climbing surface
How many rocks?
The sketch in the board
Wrist-watch
What’s there in the back?
Color of the floor
Dress of the climber
Not so important!
2. A man is bouldering at an indoor rock climbing gym.
Empty slots
3. Someone doing indoor rock climbing.
4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.
Summaries point toward information needs!
From patterns to topics to sentences
A young man climbs an artificial rock wall indoors
Adjective modifier(What kind of wall?)
Direct Object
Direct Subject
Adverb modifier (climbing where?)
Major Topic: Rock climbingSub-topics: artificial rock wall, indoor rock climbing gym
Spoken Language is complex – structured according to various grammars and dependent on active topics
Different paraphrases describe the same visual input
Object detection models
Expensive frame-wise manual annotation efforts by drawing bounding boxes
Difficulties: camera shakes, camera motion, zooming
Careful consideration to which objects/concepts to annotate?
Focus on object/concept detection – noisy for videos in-the-wild
Does not answer which objects/concepts are important for summary generation?
Man with microphone
Climbing person
Annotations for training object/concept models
Trained Models
Translating across modalitiesLearning latent translation
spaces a.k.a topics
A young man is climbing an artificial rock wall indoors
Human Synopsis
Mixed membership of latent topics
Some topics capture observations that co-occur commonly
Other topics allow for discrimination
Different topics can be responsible for different modalities
k
No annotations needed – only need clip level
summary
Translating across modalitiesUsing learnt translation
spaces for prediction
?
Text Translation
( ) ( ), , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H KO Hd o i v i d h i v i
o i h i
p w w w
p w p w
Topics are marginalized out to permute vocabulary for predictions
The lower the correlation among topics, the better the permutation
Sensitive to priors for real valued data
k
Translating across modalitiesUse learnt translation spaces for prediction
?
Text Translation
( ) ( ), , , ,
1 1 1 1
( | , )
( | ) ( | )
v O H
O K H KO Hd o i v i d h i v i
o i h i
p w w w
p w p w
Topics are marginalized out to permute vocabulary for predictions
The lower the correlation among topics, the better the permutation
Sensitive to priors for real valued data
k
Responsibility of topic i over real
valued observations
Responsibility of topic i over discrete
video featuresProbability of learnt
topic i explaining words in the text
vocabulary
Wisdom of the young padawans
Color Histogram: 512 RGB color bins histograms are computed on
densely sampled frames large deviations in the
extremities of the color spectrum are discarded
HOG3D (Histogram of oriented gradients in 3D) Effective action recognition
features for videos[A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008]
OB (Object Bank) High level semantic
representation of images from low level features
[L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010]
Wisdom of the young padawansTo
wn
hall
mee
ting
Rock
clim
bing
A man climbs a boulder outdoors with a friend spotting
An young man climbs an artificial rock wall indoors
The video is about a man answering to a question from the podium by using a microphone
Two camera men film a cop taking a camera from a woman sitting in a group
Topi
cs
Sub-Topics
Scenes from images belonging to different topics and sub-topics
Wisdom of the young padawansGlobal GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.] eight perceptual dimensions capture most of the 3D structures of real-world scenes
naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexity
GIST in general terms: An energy space that pervades the arrangements of objects Does not really care about the specificity of the objects Helps us summarize an image even after it has disappeared from our sight
Yoda’s wisdom
A man climbs a boulder outdoors with a friend spotting
An young man climbs an artificial rock wall indoors
The video is about a man answering to a question from the podium by using a microphone
Two camera men film a cop taking a camera from a woman sitting in a group
For my ally is the Force, Its energy surrounds us and binds us.
Luminous beings are we,not this crude matter.
It will be nice to have the Force as a “feature”!
Datasets NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T
datasets
Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project
Each video has its own high level summary – varies from 2 to 40 words but on average 10 words
2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set
Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords
The summarization perspective
Skateboarding
Feeding animals
Landing fishes
Wedding ceremony
Woodworking project
Multimedia Topic Model – permute
event specific vocabularies
Bag of keywords multi-document
summaries
Bag of keywords multi-document
summaries
Sub-events e.g. skateboarding, snowboarding, surfing
Multiple sets of documents (sets of frames in videos)
Natural language multi-document
summaries
Natural language multi-document
summaries
Multiple sentences (group of segments in frames)
The summarization perspective
Skateboarding
Feeding animals
Landing fishes
Wedding ceremony
Woodworking project
Multimedia Topic Model – permute
event specific vocabularies
Bag of keywords multi-document
summaries
Bag of keywords multi-document
summaries
Sub-events e.g. skateboarding, snowboarding, surfing Multiple sets of
documents (sets of frames in videos)
Natural language multi-document
summaries
Natural language multi-document
summaries
Multiple sentences (group of segments in frames)
Why event specific vocabularies?
Model Actual Synopsis Predicted Words (top 10)
One school of thought
man feeds fish bread
fish jump bread fishing skateboard pole machine car dog cat
Another school of thought
man feeds fish bread
bread shampoo sit condiment place fill plate jump pole fishing
Intuitively multiple objects and actions are shared and many different words across events get associated semantically
Prediction quality degenerates rapidly!
Previously
Words forming
other Wiki articles
Article specific content words
Words corresponding to the embedded multimedia
[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]
Afterwards
Words forming
other Wiki articles
Article specific content words
Words corresponding to the embedded multimedia
[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013]
The family of multimedia topic models• Corr-MMGLDA: If a single topic generates a scene – the same topic
generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case
• MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces– Intuitively it aids frequency based summarization
Key is to use an asymmetric Dirichlet prior
Document specific topic proportions
Synopses words
GIST features
Visual “words”
Indicator variables
Topic Parameters for explaining latent structure within observation ensembles
Corr- MMGLDA
MMGLDA
Topic modeling performance
In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods Clearly NOT a measure of keyword summary generation power
For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling
This allows only the responsible topic-Gaussians to contribute to the likelihood
Test ELBOs on events 1-5 in the Dev-T set
Measuring held-out log likelihoods on both videos and associated human summaries
Prediction ELBOs on events 1-5 in the Dev-T set
Measuring held-out log likelihoods on just videos in absence of the text
Translating Related Words to Videos
1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
Corr-MMGLDA MMGLDA
Translating Related Words to Videos
1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
Corr-MMGLDA MMGLDA
Corr-MMGLDA is able to capture more variance relative to MMGLDA
for CorrMMGLDA is also slightly higher than that for MMGLDA
This can allow related but topically unique concepts to appear upfront
Related Words to Videos – Difficult Examplesmeasure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop
sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread
Related Words to Videos – Difficult Examplesclock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man
gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire
A few words is worth a thousand frames!
From MMGLDA
A few words is worth a thousand frames!
From MMGLDA
Skateboarding
Feeding animals
Landing fishes
Wedding ceremony
Woodworking project
Multimedia Topic Model – permute
event specific vocabularies
Bag of words multi-document
summaries
Bag of words multi-document
summaries
Sub-events e.g. skateboarding, snowboarding, surfing
Multiple sets of documents (sets of frames in videos)
Natural language multi-document
summaries
Natural language multi-document
summaries
Multiple sentences (group of segments in frames)
A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification
55% test accuracy easily achievable
Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
Event classification and summarization
Skateboarding
Feeding animals
Landing fishes
Wedding ceremony
Woodworking project
Multimedia Topic Model – permute
event specific vocabularies
Bag of words multi-document
summaries
Bag of words multi-document
summaries
Sub-events e.g. skateboarding, snowboarding, surfing
Multiple sets of documents (sets of frames in videos)
Natural language multi-document
summaries
Natural language multi-document
summaries
Multiple sentences (group of segments in frames)
A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification
55% test accuracy easily achievable
Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries
- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!
- Caveat – The text multi-document summarization task is much more complex than this simpler task
Event classification and summarization
Skateboarding
Feeding animals
Landing fishes
Wedding ceremony
Woodworking project
Multimedia Topic Model – permute
event specific vocabularies
Bag of words multi-document
summaries
Bag of words multi-document
summaries
Sub-events e.g. skateboarding, snowboarding, surfing
Multiple sets of documents (sets of frames in videos)
Natural language multi-document
summaries
Natural language multi-document
summaries
Multiple sentences (group of segments in frames)
A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification
55% test accuracy easily achievable
Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries
- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!
Event classification and summarizationFuture Directions:- Typically lots of features help in
classification but do we need all of them for better summary generation?
- Does better event classification performance always mean better summarization performance?
ROUGE-1 performance MMLDA can show poor ELBO – a bit
misleading Performs quite well on predicting
summary worthy keywords
Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation
Summary worthiness of keywords is not good but topics are good
Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
MMGLDA produces better topics and higher ELBO
Summary worthiness of keywords almost same as MMLDA for lower n
ROUGE-1 performance MMLDA can show poor ELBO – a bit
misleading Performs quite well on predicting
summary worthy keywords
Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation
Summary worthiness of keywords is not good but topics are good
Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)
MMGLDA produces better topics and higher ELBO
Summary worthiness of keywords almost same as MMLDA for lower n
Future Directions Need better initialization of
priors governing parameters for real valued data
[N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(4):849 {862, 2006]
Model usefulness and applications
• Applications– Label topics through document level multimedia– Movie recommendations through semantically related
frames– Video analysis: word prediction given video features– Adword creation through semantics of multimedia
(Using transcripts only can be noisy)– Semantic compression of videos– Allowing the visually impaired to hear the world through
text
Long list of acknowledgements• Scott McCloskey (Honeywell ACS Labs)• Sangmin Oh, Amitha Perera (Kitware Inc.)• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)For helping us with feature extractions, event classification evaluations and many fruitful discussions throughout this project
• Lucy Vanderwende (Microsoft Research)• Enrique Alfonseca (Google Research)For helpful discussions during TAC 2011 on the importance of the summarization problem outside of the competitions on newswire collections
• Jack Gallant (UC Berkeley)• Francisco Pereira (Siemens Corporate Research)For allowing us to reuse some of their illustrations in this presentation
Long list of acknowledgementsThis work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.
We also thank the anonymous reviewers for their comments
Thanks!