translating related words to videos and back through latent topics

Translating Related Words to Videos and Back through Latent

Topics

Pradipto Das, Rohini K. Srihari and Jason J. CorsoSUNY Buffalo

WSDM 2013, Rome, Italy

WiSDoM is beyond words

Master Yoda, how do I find wisdom from so many things happening

around us?

Go to the center of the data and find your wisdom you will

WiSDoM is beyond words

Master Yoda, how do I find wisdom from so many things happening

around us?

Go to the center of the data and find your wisdom you will

Big data problem – lots of

data around us but which

ones are meaningful?

Need statistics from the data

that meaningfully encode

multiple views

How do the centers look like?

parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground

lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck

make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody

Be careful on what people do with their sandwiches!Interviews indoors can be tough!

parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground

lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck

make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody

Be careful on what people do with their sandwiches!Interviews indoors can be tough!

tutorial: man explains how to make lobster rolls from scratch

One guy is making sandwich outdoors

Man performs parkour in various locations

montage of guys free running up a tree and through the

woods

interview with parkour contestants

Man performs parkour in various

locations

Kid does parkour around the park

Footage of group of performing parkour outdoors

A family holds a strange burger assembly and wrapping contest at Christmas

The actual ground-truth synopses overlaid

Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011

Topic Model (LDA) Regression

Training Testing

There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed

Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description

Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from

Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011

Topic Model (LDA) Regression

Training Testing

There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed

Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description

Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from

Giving back to the community: Driverless cars are already helping the

visually impaired to drive around It will be good to enable visually

impaired drivers to hear the scenery in front

In a non-

invasive way!

Do we speak all that we see?

1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)

2. A man is bouldering at an indoor rock climbing gym.3. Someone doing indoor rock climbing.

4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.

Centers of attention (topics)

1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)

Hand holding climbing surface

How many rocks?

The sketch in the board

Wrist-watch

What’s there in the back?

Color of the floor

Dress of the climber

Not so important!

2. A man is bouldering at an indoor rock climbing gym.

Empty slots

3. Someone doing indoor rock climbing.

4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.

Summaries point toward information needs!

From patterns to topics to sentences

A young man climbs an artificial rock wall indoors

Adjective modifier(What kind of wall?)

Direct Object

Direct Subject

Adverb modifier (climbing where?)

Major Topic: Rock climbingSub-topics: artificial rock wall, indoor rock climbing gym

Spoken Language is complex – structured according to various grammars and dependent on active topics

Different paraphrases describe the same visual input

Object detection models

Expensive frame-wise manual annotation efforts by drawing bounding boxes

Difficulties: camera shakes, camera motion, zooming

Careful consideration to which objects/concepts to annotate?

Focus on object/concept detection – noisy for videos in-the-wild

Does not answer which objects/concepts are important for summary generation?

Man with microphone

Climbing person

Annotations for training object/concept models

Trained Models

Translating across modalitiesLearning latent translation

spaces a.k.a topics

A young man is climbing an artificial rock wall indoors

Human Synopsis

Mixed membership of latent topics

Some topics capture observations that co-occur commonly

Other topics allow for discrimination

Different topics can be responsible for different modalities

k

No annotations needed – only need clip level

summary

Translating across modalitiesUsing learnt translation

spaces for prediction

?

Text Translation

( ) ( ), , , ,

1 1 1 1

( | , )

( | ) ( | )

v O H

O K H KO Hd o i v i d h i v i

o i h i

p w w w

p w p w

Topics are marginalized out to permute vocabulary for predictions

The lower the correlation among topics, the better the permutation

Sensitive to priors for real valued data

k

Translating across modalitiesUse learnt translation spaces for prediction

?

Text Translation

( ) ( ), , , ,

1 1 1 1

( | , )

( | ) ( | )

v O H

O K H KO Hd o i v i d h i v i

o i h i

p w w w

p w p w

Topics are marginalized out to permute vocabulary for predictions

The lower the correlation among topics, the better the permutation

Sensitive to priors for real valued data

k

Responsibility of topic i over real

valued observations

Responsibility of topic i over discrete

video featuresProbability of learnt

topic i explaining words in the text

vocabulary

Wisdom of the young padawans

Color Histogram: 512 RGB color bins histograms are computed on

densely sampled frames large deviations in the

extremities of the color spectrum are discarded

HOG3D (Histogram of oriented gradients in 3D) Effective action recognition

features for videos[A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008]

OB (Object Bank) High level semantic

representation of images from low level features

[L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010]

Wisdom of the young padawansTo

wn

hall

mee

ting

Rock

clim

bing

A man climbs a boulder outdoors with a friend spotting

An young man climbs an artificial rock wall indoors

The video is about a man answering to a question from the podium by using a microphone

Two camera men film a cop taking a camera from a woman sitting in a group

Topi

cs

Sub-Topics

Scenes from images belonging to different topics and sub-topics

Wisdom of the young padawansGlobal GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.] eight perceptual dimensions capture most of the 3D structures of real-world scenes

naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexity

GIST in general terms: An energy space that pervades the arrangements of objects Does not really care about the specificity of the objects Helps us summarize an image even after it has disappeared from our sight

Yoda’s wisdom

A man climbs a boulder outdoors with a friend spotting

An young man climbs an artificial rock wall indoors

The video is about a man answering to a question from the podium by using a microphone

Two camera men film a cop taking a camera from a woman sitting in a group

For my ally is the Force, Its energy surrounds us and binds us.

Luminous beings are we,not this crude matter.

It will be nice to have the Force as a “feature”!

Datasets NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T

datasets

Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project

Each video has its own high level summary – varies from 2 to 40 words but on average 10 words

2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set

Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords

The summarization perspective

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of keywords multi-document

summaries


summaries

Sub-events e.g. skateboarding, snowboarding, surfing

Multiple sets of documents (sets of frames in videos)

Natural language multi-document

summaries


summaries

Multiple sentences (group of segments in frames)

The summarization perspective

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project




summaries


summaries

Sub-events e.g. skateboarding, snowboarding, surfing Multiple sets of

documents (sets of frames in videos)


summaries


summaries


Why event specific vocabularies?

Model Actual Synopsis Predicted Words (top 10)

One school of thought

man feeds fish bread

fish jump bread fishing skateboard pole machine car dog cat

Another school of thought

man feeds fish bread

bread shampoo sit condiment place fill plate jump pole fishing

Intuitively multiple objects and actions are shared and many different words across events get associated semantically

Prediction quality degenerates rapidly!

Previously

Words forming

other Wiki articles

Article specific content words

Words corresponding to the embedded multimedia

[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]

Afterwards

Words forming

other Wiki articles

Article specific content words

Words corresponding to the embedded multimedia

[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013]

The family of multimedia topic models• Corr-MMGLDA: If a single topic generates a scene – the same topic

generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case

• MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces– Intuitively it aids frequency based summarization

Key is to use an asymmetric Dirichlet prior

Document specific topic proportions

Synopses words

GIST features

Visual “words”

Indicator variables

Topic Parameters for explaining latent structure within observation ensembles

Corr- MMGLDA

MMGLDA

Topic modeling performance

In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods Clearly NOT a measure of keyword summary generation power

For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling

This allows only the responsible topic-Gaussians to contribute to the likelihood

Test ELBOs on events 1-5 in the Dev-T set

Measuring held-out log likelihoods on both videos and associated human summaries

Prediction ELBOs on events 1-5 in the Dev-T set

Measuring held-out log likelihoods on just videos in absence of the text

Translating Related Words to Videos

1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Corr-MMGLDA MMGLDA

Translating Related Words to Videos

1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Corr-MMGLDA MMGLDA

Corr-MMGLDA is able to capture more variance relative to MMGLDA

for CorrMMGLDA is also slightly higher than that for MMGLDA

This can allow related but topically unique concepts to appear upfront

Related Words to Videos – Difficult Examplesmeasure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop

sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread

Related Words to Videos – Difficult Examplesclock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man

gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire

A few words is worth a thousand frames!

From MMGLDA

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project



Bag of words multi-document

summaries


summaries




summaries


summaries


A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification

55% test accuracy easily achievable

Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)

Event classification and summarization

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project




summaries


summaries




summaries


summaries





- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries

- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!

- Caveat – The text multi-document summarization task is much more complex than this simpler task

Event classification and summarization

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project




summaries


summaries




summaries


summaries





- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries

- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!

Event classification and summarizationFuture Directions:- Typically lots of features help in

classification but do we need all of them for better summary generation?

- Does better event classification performance always mean better summarization performance?

ROUGE-1 performance MMLDA can show poor ELBO – a bit

misleading Performs quite well on predicting

summary worthy keywords

Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation

Summary worthiness of keywords is not good but topics are good

Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)

MMGLDA produces better topics and higher ELBO

Summary worthiness of keywords almost same as MMLDA for lower n

ROUGE-1 performance MMLDA can show poor ELBO – a bit

misleading Performs quite well on predicting

summary worthy keywords

Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation

Summary worthiness of keywords is not good but topics are good

Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)

MMGLDA produces better topics and higher ELBO

Summary worthiness of keywords almost same as MMLDA for lower n

Future Directions Need better initialization of

priors governing parameters for real valued data

[N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(4):849 {862, 2006]

Model usefulness and applications

• Applications– Label topics through document level multimedia– Movie recommendations through semantically related

frames– Video analysis: word prediction given video features– Adword creation through semantics of multimedia

(Using transcripts only can be noisy)– Semantic compression of videos– Allowing the visually impaired to hear the world through

text

Long list of acknowledgements• Scott McCloskey (Honeywell ACS Labs)• Sangmin Oh, Amitha Perera (Kitware Inc.)• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)For helping us with feature extractions, event classification evaluations and many fruitful discussions throughout this project

• Lucy Vanderwende (Microsoft Research)• Enrique Alfonseca (Google Research)For helpful discussions during TAC 2011 on the importance of the summarization problem outside of the competitions on newswire collections

• Jack Gallant (UC Berkeley)• Francisco Pereira (Siemens Corporate Research)For allowing us to reuse some of their illustrations in this presentation

Long list of acknowledgementsThis work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.

We also thank the anonymous reviewers for their comments

Thanks!

translating related words to videos and back through latent topics

Technology

artificial rock climbing

indoor rock climbing

artificial rock wall

rock climbingsubtopics

rockclimbing wall

makingfloor parkour

kind of wall

conventional wisdom