translating related words to videos and back through latent topics

40
Translating Related Words to Videos and Back through Latent Topics Pradipto Das, Rohini K. Srihari and Jason J. Corso SUNY Buffalo WSDM 2013, Rome, Italy

Upload: pradipto-das

Post on 10-May-2015

457 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Translating Related Words to Videos and Back through Latent Topics

Translating Related Words to Videos and Back through Latent

Topics

Pradipto Das, Rohini K. Srihari and Jason J. CorsoSUNY Buffalo

WSDM 2013, Rome, Italy

Page 2: Translating Related Words to Videos and Back through Latent Topics

WiSDoM is beyond words

Master Yoda, how do I find wisdom from so many things happening

around us?

Go to the center of the data and find your wisdom you will

Page 3: Translating Related Words to Videos and Back through Latent Topics

WiSDoM is beyond words

Master Yoda, how do I find wisdom from so many things happening

around us?

Go to the center of the data and find your wisdom you will

Big data problem – lots of

data around us but which

ones are meaningful?

Need statistics from the data

that meaningfully encode

multiple views

Page 4: Translating Related Words to Videos and Back through Latent Topics

How do the centers look like?

parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground

lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck

make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody

Be careful on what people do with their sandwiches!Interviews indoors can be tough!

Page 5: Translating Related Words to Videos and Back through Latent Topics

parkour perform traceur area flip footage jump park urban run outdoor outdoors kid group pedestrian playground

lobster burger dress celery Christmas wrap roll mix tarragon steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run interview block slide indoor perform build tab duck

make dog sandwich man outdoors guy bench black sit park white disgustingly toe cough feed rub contest parody

Be careful on what people do with their sandwiches!Interviews indoors can be tough!

tutorial: man explains how to make lobster rolls from scratch

One guy is making sandwich outdoors

Man performs parkour in various locations

montage of guys free running up a tree and through the

woods

interview with parkour contestants

Man performs parkour in various

locations

Kid does parkour around the park

Footage of group of performing parkour outdoors

A family holds a strange burger assembly and wrapping contest at Christmas

The actual ground-truth synopses overlaid

Page 6: Translating Related Words to Videos and Back through Latent Topics

Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011

Topic Model (LDA) Regression

Training Testing

There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed

Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description

Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from

Page 7: Translating Related Words to Videos and Back through Latent Topics

Back to conventional wisdom: TranslationS. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human Neuroscience, Vol. 5(72), 2011

Topic Model (LDA) Regression

Training Testing

There is some model that captures the correspondence of the blood flow patterns in the brain to the world being observed

Given a slightly different pattern we are able to translate them to concepts present in our vocabulary to a lingual description

Three basic assumptions of Machine Learning are satisfied:1) There is pattern 2) We do not know the target function 3) There is data to learn from

Giving back to the community: Driverless cars are already helping the

visually impaired to drive around It will be good to enable visually

impaired drivers to hear the scenery in front

In a non-

invasive way!

Page 8: Translating Related Words to Videos and Back through Latent Topics

Do we speak all that we see?

1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)

2. A man is bouldering at an indoor rock climbing gym.3. Someone doing indoor rock climbing.

4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.

Page 9: Translating Related Words to Videos and Back through Latent Topics

Centers of attention (topics)

1. There is a guy climbing on a rock-climbing wall.Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)

Hand holding climbing surface

How many rocks?

The sketch in the board

Wrist-watch

What’s there in the back?

Color of the floor

Dress of the climber

Not so important!

2. A man is bouldering at an indoor rock climbing gym.

Empty slots

3. Someone doing indoor rock climbing.

4. A person is practicing indoor rock climbing.5. A man is doing artificial rock climbing.

Summaries point toward information needs!

Page 10: Translating Related Words to Videos and Back through Latent Topics

From patterns to topics to sentences

A young man climbs an artificial rock wall indoors

Adjective modifier(What kind of wall?)

Direct Object

Direct Subject

Adverb modifier (climbing where?)

Major Topic: Rock climbingSub-topics: artificial rock wall, indoor rock climbing gym

Spoken Language is complex – structured according to various grammars and dependent on active topics

Different paraphrases describe the same visual input

Page 11: Translating Related Words to Videos and Back through Latent Topics

Object detection models

Expensive frame-wise manual annotation efforts by drawing bounding boxes

Difficulties: camera shakes, camera motion, zooming

Careful consideration to which objects/concepts to annotate?

Focus on object/concept detection – noisy for videos in-the-wild

Does not answer which objects/concepts are important for summary generation?

Man with microphone

Climbing person

Annotations for training object/concept models

Trained Models

Page 12: Translating Related Words to Videos and Back through Latent Topics

Translating across modalitiesLearning latent translation

spaces a.k.a topics

A young man is climbing an artificial rock wall indoors

Human Synopsis

Mixed membership of latent topics

Some topics capture observations that co-occur commonly

Other topics allow for discrimination

Different topics can be responsible for different modalities

k

No annotations needed – only need clip level

summary

Page 13: Translating Related Words to Videos and Back through Latent Topics

Translating across modalitiesUsing learnt translation

spaces for prediction

?

Text Translation

( ) ( ), , , ,

1 1 1 1

( | , )

( | ) ( | )

v O H

O K H KO Hd o i v i d h i v i

o i h i

p w w w

p w p w

Topics are marginalized out to permute vocabulary for predictions

The lower the correlation among topics, the better the permutation

Sensitive to priors for real valued data

k

Page 14: Translating Related Words to Videos and Back through Latent Topics

Translating across modalitiesUse learnt translation spaces for prediction

?

Text Translation

( ) ( ), , , ,

1 1 1 1

( | , )

( | ) ( | )

v O H

O K H KO Hd o i v i d h i v i

o i h i

p w w w

p w p w

Topics are marginalized out to permute vocabulary for predictions

The lower the correlation among topics, the better the permutation

Sensitive to priors for real valued data

k

Responsibility of topic i over real

valued observations

Responsibility of topic i over discrete

video featuresProbability of learnt

topic i explaining words in the text

vocabulary

Page 15: Translating Related Words to Videos and Back through Latent Topics

Wisdom of the young padawans

Color Histogram: 512 RGB color bins histograms are computed on

densely sampled frames large deviations in the

extremities of the color spectrum are discarded

HOG3D (Histogram of oriented gradients in 3D) Effective action recognition

features for videos[A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008]

OB (Object Bank) High level semantic

representation of images from low level features

[L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010]

Page 16: Translating Related Words to Videos and Back through Latent Topics

Wisdom of the young padawansTo

wn

hall

mee

ting

Rock

clim

bing

A man climbs a boulder outdoors with a friend spotting

An young man climbs an artificial rock wall indoors

The video is about a man answering to a question from the podium by using a microphone

Two camera men film a cop taking a camera from a woman sitting in a group

Topi

cs

Sub-Topics

Scenes from images belonging to different topics and sub-topics

Page 17: Translating Related Words to Videos and Back through Latent Topics

Wisdom of the young padawansGlobal GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.] eight perceptual dimensions capture most of the 3D structures of real-world scenes

naturalness, openness, perspective or expansion, size or roughness, ruggedness, mean depth, symmetry and complexity

GIST in general terms: An energy space that pervades the arrangements of objects Does not really care about the specificity of the objects Helps us summarize an image even after it has disappeared from our sight

Page 18: Translating Related Words to Videos and Back through Latent Topics

Yoda’s wisdom

A man climbs a boulder outdoors with a friend spotting

An young man climbs an artificial rock wall indoors

The video is about a man answering to a question from the podium by using a microphone

Two camera men film a cop taking a camera from a woman sitting in a group

For my ally is the Force, Its energy surrounds us and binds us.

Luminous beings are we,not this crude matter.

It will be nice to have the Force as a “feature”!

Page 19: Translating Related Words to Videos and Back through Latent Topics

Datasets NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T

datasets

Training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an appliance 15) Working on a sewing project

Each video has its own high level summary – varies from 2 to 40 words but on average 10 words

2062 clips in the training set and 530 clips for the first 5 events in the Dev-T set

Dev-T summaries are only used as reference summaries for evaluation with up to 10 predicted keywords

Page 20: Translating Related Words to Videos and Back through Latent Topics

The summarization perspective

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of keywords multi-document

summaries

Bag of keywords multi-document

summaries

Sub-events e.g. skateboarding, snowboarding, surfing

Multiple sets of documents (sets of frames in videos)

Natural language multi-document

summaries

Natural language multi-document

summaries

Multiple sentences (group of segments in frames)

Page 21: Translating Related Words to Videos and Back through Latent Topics

The summarization perspective

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of keywords multi-document

summaries

Bag of keywords multi-document

summaries

Sub-events e.g. skateboarding, snowboarding, surfing Multiple sets of

documents (sets of frames in videos)

Natural language multi-document

summaries

Natural language multi-document

summaries

Multiple sentences (group of segments in frames)

Why event specific vocabularies?

Model Actual Synopsis Predicted Words (top 10)

One school of thought

man feeds fish bread

fish jump bread fishing skateboard pole machine car dog cat

Another school of thought

man feeds fish bread

bread shampoo sit condiment place fill plate jump pole fishing

Intuitively multiple objects and actions are shared and many different words across events get associated semantically

Prediction quality degenerates rapidly!

Page 22: Translating Related Words to Videos and Back through Latent Topics

Previously

Words forming

other Wiki articles

Article specific content words

Words corresponding to the embedded multimedia

[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]

Page 23: Translating Related Words to Videos and Back through Latent Topics

Afterwards

Words forming

other Wiki articles

Article specific content words

Words corresponding to the embedded multimedia

[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and Back through Latent Topics,” WSDM, Rome, Italy, 2013]

Page 24: Translating Related Words to Videos and Back through Latent Topics

The family of multimedia topic models• Corr-MMGLDA: If a single topic generates a scene – the same topic

generates all text in the document – a considerable strongpoint but a drawback for summary generation if this is not the case

• MMGLDA: More diffuse translation of both visual and textual patterns through the latent translation spaces– Intuitively it aids frequency based summarization

Key is to use an asymmetric Dirichlet prior

Document specific topic proportions

Synopses words

GIST features

Visual “words”

Indicator variables

Topic Parameters for explaining latent structure within observation ensembles

Corr- MMGLDA

MMGLDA

Page 25: Translating Related Words to Videos and Back through Latent Topics

Topic modeling performance

In a purely multinomial MMLDA model, failures of independent events contribute highly negative terms to the log likelihoods Clearly NOT a measure of keyword summary generation power

For the MMGLDA family of models, Gaussian components can partially remove the independence through covariance modeling

This allows only the responsible topic-Gaussians to contribute to the likelihood

Test ELBOs on events 1-5 in the Dev-T set

Measuring held-out log likelihoods on both videos and associated human summaries

Prediction ELBOs on events 1-5 in the Dev-T set

Measuring held-out log likelihoods on just videos in absence of the text

Page 26: Translating Related Words to Videos and Back through Latent Topics

Translating Related Words to Videos

1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Corr-MMGLDA MMGLDA

Page 27: Translating Related Words to Videos and Back through Latent Topics

Translating Related Words to Videos

1 2 3 4 5 6 7 8 9 10Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971Corr-MMGLDA: log (α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164MMGLDA: log (α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Corr-MMGLDA MMGLDA

Corr-MMGLDA is able to capture more variance relative to MMGLDA

for CorrMMGLDA is also slightly higher than that for MMGLDA

This can allow related but topically unique concepts to appear upfront

Page 28: Translating Related Words to Videos and Back through Latent Topics

Related Words to Videos – Difficult Examplesmeasure project lady tape indoor sew marker pleat highwaist zigzag scissor card mark teach cut fold stitch pin woman skirt machine fabric inside scissors make leather kilt man beltloop

sew woman fabric make machine show baby traditional loom blouse outdoors blanket quick rectangle hood knit indoor stitch scissors pin cut iron studio montage measure kid penguin dad stuff thread

Page 29: Translating Related Words to Videos and Back through Latent Topics

Related Words to Videos – Difficult Examplesclock mechanism repair computer tube wash machine lapse click desk mouse time front wd40 pliers reattach knob make level video water control person clip part wire inside indoor whirlpool man

gear machine guy repair sew fan test make replace grease vintage motor box indoor man tutorial fuse bypass brush wrench repairman lubricate workshop bottom remove screw unscrew screwdriver video wire

Page 30: Translating Related Words to Videos and Back through Latent Topics

A few words is worth a thousand frames!

From MMGLDA

Page 31: Translating Related Words to Videos and Back through Latent Topics

A few words is worth a thousand frames!

From MMGLDA

Page 32: Translating Related Words to Videos and Back through Latent Topics

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of words multi-document

summaries

Bag of words multi-document

summaries

Sub-events e.g. skateboarding, snowboarding, surfing

Multiple sets of documents (sets of frames in videos)

Natural language multi-document

summaries

Natural language multi-document

summaries

Multiple sentences (group of segments in frames)

A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification

55% test accuracy easily achievable

Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)

Event classification and summarization

Page 33: Translating Related Words to Videos and Back through Latent Topics

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of words multi-document

summaries

Bag of words multi-document

summaries

Sub-events e.g. skateboarding, snowboarding, surfing

Multiple sets of documents (sets of frames in videos)

Natural language multi-document

summaries

Natural language multi-document

summaries

Multiple sentences (group of segments in frames)

A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification

55% test accuracy easily achievable

Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)

- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries

- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!

- Caveat – The text multi-document summarization task is much more complex than this simpler task

Event classification and summarization

Page 34: Translating Related Words to Videos and Back through Latent Topics

Skateboarding

Feeding animals

Landing fishes

Wedding ceremony

Woodworking project

Multimedia Topic Model – permute

event specific vocabularies

Bag of words multi-document

summaries

Bag of words multi-document

summaries

Sub-events e.g. skateboarding, snowboarding, surfing

Multiple sets of documents (sets of frames in videos)

Natural language multi-document

summaries

Natural language multi-document

summaries

Multiple sentences (group of segments in frames)

A c-SVM classier from the libSVM package is used with default settings for multiclass (15 classes) classification

55% test accuracy easily achievable

Evaluate using ROUGE-1HEXTAC 2009: 100-word human references vs. 100-word manually extracted summaryAverage Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)

- Usually changes from dataset to dataset but max around 40-45% for 100 word system summaries

- If we can achieve 10% of this for 10 word summaries, we are doing pretty good!

Event classification and summarizationFuture Directions:- Typically lots of features help in

classification but do we need all of them for better summary generation?

- Does better event classification performance always mean better summarization performance?

Page 35: Translating Related Words to Videos and Back through Latent Topics

ROUGE-1 performance MMLDA can show poor ELBO – a bit

misleading Performs quite well on predicting

summary worthy keywords

Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation

Summary worthiness of keywords is not good but topics are good

Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)

MMGLDA produces better topics and higher ELBO

Summary worthiness of keywords almost same as MMLDA for lower n

Page 36: Translating Related Words to Videos and Back through Latent Topics

ROUGE-1 performance MMLDA can show poor ELBO – a bit

misleading Performs quite well on predicting

summary worthy keywords

Sum-normalizing the real valued data to lie in [0,1]P distorts reality for Corr-MGLDA w.r.t. quantitative evaluation

Summary worthiness of keywords is not good but topics are good

Different but related topics can model GIST features almost equally (strong overlap in the tail of the Gaussians)

MMGLDA produces better topics and higher ELBO

Summary worthiness of keywords almost same as MMLDA for lower n

Future Directions Need better initialization of

priors governing parameters for real valued data

[N. Nasios and A.G. Bors. Variational learning for gaussian mixture models. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(4):849 {862, 2006]

Page 37: Translating Related Words to Videos and Back through Latent Topics

Model usefulness and applications

• Applications– Label topics through document level multimedia– Movie recommendations through semantically related

frames– Video analysis: word prediction given video features– Adword creation through semantics of multimedia

(Using transcripts only can be noisy)– Semantic compression of videos– Allowing the visually impaired to hear the world through

text

Page 38: Translating Related Words to Videos and Back through Latent Topics

Long list of acknowledgements• Scott McCloskey (Honeywell ACS Labs)• Sangmin Oh, Amitha Perera (Kitware Inc.)• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)For helping us with feature extractions, event classification evaluations and many fruitful discussions throughout this project

• Lucy Vanderwende (Microsoft Research)• Enrique Alfonseca (Google Research)For helpful discussions during TAC 2011 on the importance of the summarization problem outside of the competitions on newswire collections

• Jack Gallant (UC Berkeley)• Francisco Pereira (Siemens Corporate Research)For allowing us to reuse some of their illustrations in this presentation

Page 39: Translating Related Words to Videos and Back through Latent Topics

Long list of acknowledgementsThis work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20069. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.

We also thank the anonymous reviewers for their comments

Page 40: Translating Related Words to Videos and Back through Latent Topics

Thanks!