modeling images, videos and text using the caffe deep learning library, part 1 (by kate saenko)
TRANSCRIPT
![Page 1: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/1.jpg)
Modeling Images, Videos and Text
Using the Caffe Deep Learning Library
(Part 1)
Kate Saenko
Microsoft Summer Machine Learning School, St Petersburg 2015
![Page 2: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/2.jpg)
about me
BOSTON, Massachusetts
![Page 3: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/3.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 4: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/4.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 5: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/5.jpg)
Machine Learning: What is it?
• Program a computer to learn from experience
• Learn from “big data”
![Page 6: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/6.jpg)
Machine Learning: It is used in more ways than you think!
![Page 7: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/7.jpg)
Computer Vision: Teach Machine to “See” Like a Human
![Page 8: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/8.jpg)
Terminator 2
Hollywood version…
terminator 2, enemy of the state (from UCSD “Fact or Fiction” DVD)
Computer Vision: Teach Machine to “See” Like a Human
![Page 9: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/9.jpg)
Computer Vision in Real Life:
Face Tagging in Social Media
![Page 10: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/10.jpg)
Computer Vision in Real Life: Surveillance and Security
![Page 11: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/11.jpg)
• Stanford/Google one of the first to develop self-driving cars
• Cars “see” using many sensors: radar, laser, cameras
Computer Vision in Real Life:
Smart Cars
![Page 12: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/12.jpg)
Computer Vision in Real Life:
Scientific Images
![Page 13: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/13.jpg)
Image guided surgery
Grimson et al., MIT 3D imaging
MRI, CT
slide by S. Seitz
Computer Vision in Real Life:
Medical Imaging
![Page 14: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/14.jpg)
http://www.robocup.org/
NASA’s Mars Spirit Rover
http://en.wikipedia.org/wiki/Spirit_rover
slide by S. Seitz
Computer Vision in Real Life:
Robot Vision
![Page 15: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/15.jpg)
Computer Vision in Real Life:
many other applications! 3D Shape Analysis
3D Face Reconstruction
http://grail.cs.washington.edu/projects/totalmoving/
Handwriting Recognition
3D Panoramas
![Page 16: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/16.jpg)
How Do We Do It?
![Page 17: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/17.jpg)
Computer Vision: Machine Learning from Big Data
Artificial Neural Network
Support Vector Machine
![Page 18: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/18.jpg)
Machine Learning from Big Data: Achievements
Artificial Neural Network
![Page 19: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/19.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 20: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/20.jpg)
http://arxiv.org/abs/1411.4389
Image Description
![Page 21: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/21.jpg)
Video Description
Output: A woman shredding chicken in a kitchen
Input
video:
![Page 22: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/22.jpg)
Social media analysis
A person dancing in a studio Machine sharpening a pencil
Ballerina dancing on stage Man playing guitar Woman chopping onion
Train passing by Mt. Fuji
Petabytes of video, very
little text
![Page 23: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/23.jpg)
Social media: retrieval
Show me all video clips of a person playing guitar and singing
![Page 24: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/24.jpg)
Social media: summarization
A car is driving
down a road A man is riding a
bike through the
woods
A skateboarder
jumps and falls
![Page 25: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/25.jpg)
Surveillance and Security
Smart camera alerts:
A woman wearing a
red coat walked past
A woman carrying a
large bag entered a
building
![Page 26: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/26.jpg)
Question answering
How many times did Darth Vader use a light saber?
![Page 27: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/27.jpg)
Assistive technology
Descriptive Video Service (DVS)
![Page 28: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/28.jpg)
Object/action detection not enough
• Does not model interaction
between entities and scene
• Does not model what is
important to say
• Natural language is much
richer
![Page 29: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/29.jpg)
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
![Page 30: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/30.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 31: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/31.jpg)
Dealing with uncertainty in
“in-the-wild” YouTube video
Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell,
and Saenko. 2013. Youtube2text: Recognizing and describing arbitrary
activities using semantic hierarchies and zero-shot recognition. In IEEE
International Conference on Computer Vision (ICCV).
![Page 32: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/32.jpg)
A is a .
SUBJECT VERB OBJECT person ride motorbike
<SUBJECT> <VERB> <OBJECT> A person is riding a motorbike .
A template model
![Page 33: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/33.jpg)
OBJECT DETECTIONS
cow 0.11 person 0.42
table 0.07
aeroplane 0.05 dog 0.15
motorbike 0.51 train 0.17
car 0.29
![Page 34: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/34.jpg)
SORTED OBJECT DETECTIONS
motorbike 0.51
person 0.42
car 0.29
aeroplane 0.05
… …
![Page 35: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/35.jpg)
VERB DETECTIONS
hold 0.23 drink 0.11
move 0.34
dance 0.05 slice 0.13
climb 0.17 shoot 0.07
ride 0.19
![Page 36: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/36.jpg)
SORTED VERB DETECTIONS
move 0.34
hold 0.23
ride 0.19
dance 0.05
… …
![Page 37: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/37.jpg)
SORTED VERB DETECTIONS
move 0.34
hold 0.23
ride 0.19
dance 0.05
… …
motorbike 0.51
person 0.42
car 0.29
aeroplane 0.05
… …
SORTED OBJECT DETECTIONS
![Page 38: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/38.jpg)
Vision pipeline
![Page 39: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/39.jpg)
Input
Video:
Output sentence: “Woman sharpens baby”
Problem: detection mistakes
SORTED VERB DETECTIONS
sharpen 0.34
cut 0.23
… …
woman 0.51
baby 0.42
… …
SORTED OBJECT DETECTIONS
![Page 40: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/40.jpg)
Idea: trade off accuracy and specificity
Learn hierarchies from S, V, O co-occurrence
![Page 41: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/41.jpg)
input
video
being
person animal
…
most specific prediction
“Woman sharpens baby”
do
work play
…
baby …
knife … … …
sharpen clamp woman man … …
entity
person tool
… …
“Man clamps knife”, “Person clamps knife, “Man working” Humans:
Subject Verb Object
![Page 42: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/42.jpg)
input
video
being
person animal
…
do
work play
…
baby …
knife … … …
sharpen clamp woman man … …
entity
person tool
… …
“Man clamps knife”, “Person clamps knife, “Man working” Humans:
Subject Verb Object
our prediction
“Person working with a tool”
![Page 43: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/43.jpg)
collected for paraphrase and machine translation
2,089 YouTube videos with 122K multi-lingual descriptions.
Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/
Microsoft YouTube Dataset (Chen & Dolan, ACL 2011)
![Page 44: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/44.jpg)
Microsoft YouTube Dataset (Chen & Dolan, ACL 2011) (a) Hollywood (8 actions) (b) TRECVID MED (6 actions)
(c) YouTube (218 actions)
A train is rolling by.A train passes by Mount Fuji.A bullet train zooms through the countryside.A train is coming down the tracks.
A man is sitting and playing a guitarA man is playing guitarStreet artists play guitar.A man is playing a guitar.a lady is playing the guitar.
A woman is cooking onions.Someone is cooking in a pan.someone preparing somethinga person coking.racipe for katsu curry
A girl is ballet dancing.A girl is dancing on a stage.A girl is performing as a ballerina.A woman dances.
We cluster words to obtain about 200 verbs and 300 nouns.
![Page 45: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/45.jpg)
Video Collection Task
• Asked Amazon Mechanical Turk workers to submit video clips from YouTube
• Single, unambiguous action/event
• Short (4-10 seconds)
• Generally accessible
• No dialogue
• No words (subtitles, overlaid text, titles)
![Page 46: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/46.jpg)
Generalization results on MSFT Youtube
![Page 47: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/47.jpg)
Generalization results on MSFT Youtube
![Page 48: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/48.jpg)
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
Need to model language • Syntax, semantics, common sense
• can a squirrel drive a car? an onion play guitar?
![Page 49: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/49.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 50: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/50.jpg)
Modeling common
sense J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney.
Integrating language and vision to gener- ate natural language descriptions of
videos in the wild. In Proceedings of the 25th International Conference on
Computational Linguistics (COLING), August 2014.
![Page 51: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/51.jpg)
Input
Video:
• Output sentence: “Woman sharpens baby”
• Common sense: babies cannot be
sharpened
• Idea: learn common SVO statistics from
very large text-only corpora
?
Problem: no common sense
![Page 52: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/52.jpg)
Adding a linguistic prior using a
Factor Graph Model (FGM)
Visual confidence values are observed (gray potentials) and inform sentence components.
Language potentials (dashed) connect latent words between sentence components.
![Page 53: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/53.jpg)
Mining Text Corpora Corpora Size of text parsed
British National Corpus (BNC) 1.5GB
GigaWord 26GB
ukWaC 5.5GB
WaCkypedia_EN 2.6GB
GoogleNgrams 1012 words
Stanford dependency parses from first 4 corpora used to build
SVO language model.
Full language model used for surface realization trained on
GoogleNgrams using BerkeleyLM
![Page 54: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/54.jpg)
Evaluation
● Subject, Verb, Object accuracy o Compare to SVO extracted from ground truth
sentences
o “A woman shredding chicken in a kitchen”
o Most Common SVO, or Any Valid SVO
S V O
![Page 55: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/55.jpg)
Results of SVO prediction
HVC: highest vision confidence
GFM: factor graph with language prior
Binary accuracy of “Most Common” SVO
![Page 56: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/56.jpg)
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
Need to model language • Syntax, semantics, common sense
• can a squirrel drive a car? an onion play guitar?
Sequence-to-sequence • Both input AND output are sequences
• So far we have assumed video and sentence are both fixed length
• What are good features? Can we learn them?
![Page 57: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/57.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 58: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/58.jpg)
Neurons in the Brain
Artificial Neural Network
![Page 59: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/59.jpg)
Neuron in the brain
“Input wire”
“Output wire”
![Page 60: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/60.jpg)
0
-2
+4
0
+2
Input
Multiply by
weights
Sum Threshold
Output
Artificial Neuron
![Page 61: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/61.jpg)
+4
+2
0
+3
-2
0
-2
+4
0
+2
-8 0
Artificial Neuron: Activation
Input
Multiply by
weights
Sum Threshold
![Page 62: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/62.jpg)
+4
+2
0
+3
-2
0
-2
+4
0
+2
-8 0
0
-2
0
+2
+2
0
-2
+4
0
+2
+8 +8
Artificial Neuron: Activation
Input
Multiply by
weights
Sum Threshold
Neurons learn
patterns!
![Page 63: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/63.jpg)
1
Artificial Neuron: Pattern Classification
Input
+4
+2
0
-3
-2
class
0
Input
+4
+2
0
-3
-2
class
va
lues d
ecre
ase
va
lues d
ecre
ase
• Classify input into
class 0 or 1
• Teach neuron to
predict correct
class label
Example
![Page 64: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/64.jpg)
+4
+2
0
-3
-2
0
-2
+4
0
+2
-4 0
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
Adjust
weights
= =
![Page 65: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/65.jpg)
+4
+2
0
-3
-2
-2
+4
0
+1
0 0
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
+1
Adjust
weights
= =
![Page 66: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/66.jpg)
+4
+2
0
-3
-2
0
+4
0
+1
+2 1
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
+1
Adjust
weights
= =
![Page 67: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/67.jpg)
Input
Output
Weights
Artificial Neural Network
Simplify
![Page 68: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/68.jpg)
Input
Output
Artificial Neural Network Hidden Layer Input Layer Output Layer
Deep Network: many
hidden layers!
Single Neuron Neural Network
![Page 69: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/69.jpg)
Artificial Neural Network Hidden Layer Input Layer Output Layer
x1
x2
x3
x4
x5
a1
a2
a3
h1
h2
h3
𝑥 =
𝑥1…𝑥5
Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35
ℎΘ(x) = 𝑔(Θ(2)𝑎)
a = 𝑔(Θ(1)𝑥)
Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33
weights
input
hidden layer activations
𝑔 𝑧 =1
1 + exp(−𝑧)
output
1
0.5
0
![Page 70: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/70.jpg)
Artificial Neural Network
𝑥 =
𝑥1…𝑥5
Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35
ℎΘ(x) = 𝑔(Θ(2)𝑎)
a = 𝑔(Θ(1)𝑥)
Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33
weights
input
hidden layer activations
𝑔 𝑧 =1
1 + exp(−𝑧)
output
1
0.5
0
ℎΘ(x) = estimated probability that class=1
for input x
𝑔 𝑧
𝑧
ℎΘ x = 0.2
ℎΘ x = 0.8
predict class=0
predict class=1
![Page 71: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/71.jpg)
Layer 3 Layer 1 Layer 2 Layer 4
Network architectures
Recurrent
Convolutional
input
input
inp
ut
hid
de
n
hid
de
n
hid
den
ou
tpu
t
ou
tpu
t
ou
tpu
t time
![Page 72: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/72.jpg)
Representing Images
Input Layer
Reshape into
a vector
![Page 73: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/73.jpg)
1 0 -1
1 0 -1
1 0 -1
Convolve with Threshold
Convolutional Neural Network
Input Layer
Output Layer
slide by Abi-Roozgard
![Page 74: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/74.jpg)
w13 w12 w11
w23 w22 w21
w33 w32 w31
1 0 -1
1 0 -1
1 0 -1
Convolve with Threshold
w13 w12 w11
w23 w22 w21
w33 w32 w31
Convolutional Neural Network
slide by Abi-Roozgard
Input Layer
Output Layer
![Page 75: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/75.jpg)
Convolutional Neural Network
LeNet
![Page 76: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/76.jpg)
Why Deep Learning? The Unreasonable Effectiveness of Deep Features
Rich visual structure of features deep in hierarchy.
[R-CNN]
[Zeiler-Fergus]
Maximal activations of pool5 units
conv5 DeConv visualization
![Page 77: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/77.jpg)
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
![Page 78: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/78.jpg)
Deep Convolutional-
Recurrent Network Translating Videos to Natural Language Using Deep Recurrent Neural
Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus
Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
Long-term Recurrent Convolutional Networks for Visual Recognition and
Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus
Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
![Page 79: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/79.jpg)
Deep Convolutional Neural Networks
• Learn robust high-level image representations
• achieve state-of-the-art results on image tasks
• But, do not handle sequences
• Idea: combine with Recurrent Neural Network
Recurrent
Convolutional
inp
ut
inp
ut
inp
ut
hid
den
hid
den
hid
den
ou
tpu
t
ou
tpu
t
ou
tpu
t
time
![Page 80: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/80.jpg)
Contributions
● End-to-end deep video description o deep image and language model
● Leverage still image caption data
![Page 81: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/81.jpg)
Recurrent neural networks
• Distributed hidden state stores
past information efficiently
• Non-linear dynamics
• with enough neurons and time,
can compute any function
input
input
input
hid
den
hid
den
hid
den
outp
ut
outp
ut
outp
ut
time
based on slide by Geoff Hinton
![Page 82: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/82.jpg)
Model
CNN
1. Extract deep features from each frame
2. Create a fixed length vector representation of the video
input frames
Sentence
3. Decode the vector to a sentence
input
input
input
hid
den
hid
den
hid
den
outp
ut
outp
ut
outp
ut
RNN
![Page 83: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/83.jpg)
● Easier to train than RNN
● Impressive results for
○ speech recognition
○ handwriting
recognition
○ translation
Background - LSTM Unit
● Our model o 2 layers of LSTM units (hidden state
of first is input to second)
o Output- Softmax - probability
distribution over vocabulary of words
![Page 84: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/84.jpg)
Input Video Convolutional Net Recurrent Net Output
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A
. . .
boy
is
playing
a
ball
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
mea
n p
oo
ling
CNN
CNN
CNN
CNN
CNN
. . .
![Page 85: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/85.jpg)
Evaluation
● Subject, Verb, Object accuracy o SVO extracted from the generated sentence
o “A woman shredding chicken in a kitchen”
o Most Common or Any Valid
● BLEU
● METEOR
● Human Evaluation
S V O
![Page 86: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/86.jpg)
Results - SVO (Binary, Most Common)
![Page 87: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/87.jpg)
Results - Generation
Model BLEU METEOR
FGM 13.68 23.9
LSTMFlickr 10.29 19.52
LSTMCOCO 12.66 20.96
LSTM-YT 31.19 26.87
LSTM-YTFlickr 32.03 27.87
LSTM-YTCOCO 33.29 29.07
LSTM-YTCOCO+Flickr 33.29 28.88 More fluent, but not enough training data in Youtube dataset
to train good language model
![Page 88: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/88.jpg)
Idea: pre-learn on still image captions
Dataset Train Validation Test
Flickr30k ~28000 1000 1000
COCO2014 82783 40504 -
YouTube 1200 100 670
![Page 89: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/89.jpg)
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
Pre-Train on Still Images (Flickr30k and COCO2014)
A
man
is
scaling
a
cliff
Input Image Convolutional Net Recurrent Net Output
![Page 90: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/90.jpg)
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Input Image Convolutional Net Recurrent Net Output
![Page 91: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/91.jpg)
mea
n p
oo
ling
Input Video Convolutional Net Recurrent Net Output
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A
. . .
boy
is
playing
a
ball
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
CNN
CNN
CNN
CNN
. . .
Fine-Tune on Videos
![Page 92: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/92.jpg)
Results - SVO (Binary, Most Common)
See also Guadarrama et al. ICCV 2013
![Page 93: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/93.jpg)
Results - Generation
Model BLEU METEOR
FGM 13.68 23.9
LSTMFlickr 10.29 19.52
LSTMCOCO 12.66 20.96
LSTM-YT 31.19 26.87
LSTM-YTFlickr 32.03 27.87
LSTM-YTCOCO 33.29 29.07
LSTM-YTCOCO+Flickr 33.29 28.88
![Page 94: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/94.jpg)
Results - Human Eval.
Model Relevance Grammar
FGM 3.20 3.99
LSTM-YT 2.88 3.84
LSTM-YTCOCO 2.83 3.46
LSTM-YTCOCO+Flickr - 3.64
GroundTruth 1.10 4.61
![Page 95: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/95.jpg)
Generated image captions
http://arxiv.org/abs/1411.4389
![Page 96: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/96.jpg)
http://arxiv.org/abs/1411.4389
Generated image captions
![Page 97: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/97.jpg)
Sequence-to-Sequence
Video-to-Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond
Mooney, Trevor Darrell, Kate Saenko; arXiv 2015
![Page 98: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/98.jpg)
Sequence to Sequence Video to Text
LSTM
111
![Page 99: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/99.jpg)
MSVD dataset (youtube videos)
Results
11
2
Movie description datasets
![Page 100: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/100.jpg)
Qualitative results Results
11
3
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
ICCV
#1211
ICCV
#1211
ICCV 2015 Submission #1211. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 3. MSVD YouTube video dataset. We present examples where S2VT model (RGB on VGG net) generates correct descriptions
involving different objects and actions for several videos (in column a). The center column (b) shows examples where the model predicts
relevant but incorrect descriptions. The last column (c) shows examples where the model generates descriptions that are irrelevant to the
event in the video.
Figure4. M-VAD Moviecorpus: Weshow arepresentativeframefrom 6 contiguousclips from themovie “BigMommas: LikeFather, Like
Son”. Soft Attention (GNet +3D-Conv) aresentences from themodel in [40]. S2VT (MPII+MVAD) represents thesentences generated by
our model trained on both the MPII and M-VAD datasets. DVS represents the original ground truth sentences in the dataset for each clip.
[4] D. L. Chen and W. B. Dolan. Collecting highly parallel data
for paraphrase evaluation. In ACL, Portland, Oregon, USA,
June 2011. 2, 5
[5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-
lar, and C. L. Zitnick. Microsoft coco captions: Data collec-
tion and evaluation server. arXiv preprint arXiv:1504.00325,
2015. 6
[6] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-
resentation for image caption generation. arXiv:1411.5654,
2014. 1
8
![Page 101: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/101.jpg)
Results
![Page 102: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/102.jpg)
Results
![Page 103: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/103.jpg)
Results
![Page 104: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/104.jpg)
Qualitative results on Hollywood Movies
11
7
(1) (2) (3) (4) (5) (6a) (6b)
![Page 105: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/105.jpg)
![Page 106: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/106.jpg)
![Page 107: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)](https://reader035.vdocument.in/reader035/viewer/2022062515/55ccda67bb61eb07318b45c2/html5/thumbnails/107.jpg)
Thanks
Subhashini
Venugopalan Huijuan
Xu
Jeff
Donahue
Marcus
Rohrbach
Raymond
Mooney Trevor Darrell
UT Austin UMass Lowell UC Berkeley UC Berkeley UT Austin UC Berkeley
References [1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to gener- ate
natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational
Linguistics (COLING), August 2014.
[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor
Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and
zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).
[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff
Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio
Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; arXiv 2015
Sergio
Guadarrama