computer vision
DESCRIPTION
Computer VisionTRANSCRIPT
![Page 1: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/1.jpg)
Ivan Laptev
WILLOW, INRIA/ENS/CNRS, Paris
Computer Vision:
Weakly-supervised learning from
video and images
CSClub
Saint Petersburg
November 17, 2014
Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab –
Francis Bach – Leon Bottou – Jean Ponce –
Cordelia Schmid – Josef Sivic
![Page 2: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/2.jpg)
Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891
VisionLabs – команда профессионалов, обладающихзначительными знаниями и существенным практическимопытом в сфере разработки алгоритмов компьютерногозрения и интеллектуальных систем.
Мы создаем и внедряем технологии компьютерного зрения, открывая новые
возможности для изменения окружающего нас мира к лучшему.
О компании
– Advertisement –
![Page 3: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/3.jpg)
Команда
АлександрХанинChief
ExecutiveOfficer
АлексейНехаев
ExecutiveOfficer
Слава Казьмин
ChiefTechnical
Officer
ИванЛаптев
Scientificadvisor
СергейМиляев
SeniorCV engineer
АлексейКордичевFinancialadvisor
ИванТрусковSoftwaredeveloper
СергейЧерепанов
Softwaredeveloper
Наша команда –симбиоз науки и бизнеса
Направления деятельностиТехнология распознавания лицСистема выявления мошенников в банках
Технология распознавания номеровСистема учета и автоматизации доступа транспорта
Технологии для безопасного городаСистема выявления нарушений и опасных ситуаций
– Advertisement –
![Page 4: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/4.jpg)
– Advertisement –
Проекты масштаба государства
Достижения
![Page 5: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/5.jpg)
– Advertisement –
Мы ищем единомышленников
Создание и внедрение интеллектуальных систем
Решение интересных практических задач
Работа в дружной амбициозной команде
Спасибо за внимание!
Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891
![Page 6: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/6.jpg)
What is Computer Vision?
![Page 7: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/7.jpg)
7
What is Computer Vision?
![Page 8: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/8.jpg)
![Page 9: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/9.jpg)
What is the recent progress?
1990s:
Recognition at the level of a few
toy objects (COIL 20 dataset)
ResearchIndustry
Automated quality inspection
(controlled lighting, scale,…)
Now:
Face recognition in social media ImageNet: 14M images, 21K classes
6% Top-5 error rate in 2014 Challenge
![Page 10: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/10.jpg)
~5K image uploads
every min. >34K hours of video
upload every day
TV-channels recorded
since 60’s
~30M surveillance cameras in US
=> ~700K video hours/day
~2.5 Billion new
images / month
And even more with future
wearable devices
Why image and video analysis?Data:
![Page 11: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/11.jpg)
Movies TV
YouTube
Why looking at people?
How many person-pixels are in the video?
![Page 12: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/12.jpg)
Movies TV
YouTube
How many person-pixels are in the video?
40%
35% 34%
Why looking at people?
![Page 13: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/13.jpg)
How many person pixels
in our daily life?
Wearable camera data: Microsoft SenseCam dataset
![Page 14: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/14.jpg)
How many person pixels
in our daily life?
Wearable camera data: Microsoft SenseCam dataset
~4%
![Page 15: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/15.jpg)
Large variations in appearance:occlusions, non-rigid motion, view-point changes, clothing…
What are the difficulties?
Manual collection of training samples is prohibitive: many action classes, rare occurrence
Action vocabulary is not well-defined
…
Action Open:
…
…
Action Hugging:
![Page 16: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/16.jpg)
This talk:
Brief overview of recent techniques
Weakly-supervised learning from video
and scripts
Weakly-supervised learning with
convolutional neural networks
![Page 17: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/17.jpg)
Standard visual recognition pipeline
GetOutCar AnswerPhone
Kiss
HandShake StandUp
DriveCar
Collect image/video samples and corresponding class labels
Design appropriate data representation, with certain invariance properties
Design / use existing machine learning methods for learning and classification
![Page 18: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/18.jpg)
Occurrence histogram
of visual words
space-time patches
Extraction of
Local features
Feature
description
K-means
clustering
(k=4000)
Feature
quantization
Non-linear
SVM with χ2
kernel
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
Bag-of-Features action recognition
![Page 19: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/19.jpg)
Action classification
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,
“Indiana Jones and the Last Crusade”
![Page 20: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/20.jpg)
Where to get training data?
Shoot actions in the lab•
KTH dataset
Weizman dataset,…
- Limited variability
- Unrealistic
Manually annotate existing content•
HMDB, Olympic Sports,
UCF50, UCF101, …
- Very time-consuming
Use readily-available video scripts•
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
- Scripts are available for 1000’s of hours of movies and TV-series
- Scripts describe dynamic and static content of videos
![Page 21: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/21.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
21
![Page 22: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/22.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
22
![Page 23: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/23.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
23
![Page 24: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/24.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
24
![Page 25: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/25.jpg)
…
1172
01:20:17,240 --> 01:20:20,437
Why weren't you honest with me?
Why'd you keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.
Victor wanted it that way.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
RICK
Why weren't you honest with me? Why
did you keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
…
01:20:17
01:20:23
subtitles movie script
• Scripts available for >500 movies (no time synchronization)
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
• Subtitles (with time info.) are available for the most of movies
• Can transfer time to scripts by text alignment
Script-based video annotation
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
![Page 26: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/26.jpg)
Scripts as weak supervision
Un
ce
rta
inty
24:25
24:51
Imprecise temporal localization•
No explicit spatial localization •
NLP problems, scripts ≠ training labels•
“… Will gets out of the Chevrolet. …”
“… Erin exits her new truck…”vs. Get-out-car
Challenges:
![Page 27: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/27.jpg)
Previous work
Sivic, Everingham, and Zisserman,
''Who are you?'' -- Learning Person Specific
Classifiers from Video, In CVPR 2009.
Buehler, Everingham, and Zisserman "Learning
sign language by watching TV (using weakly
aligned subtitles)", In CVPR 2009.
Duchenne, Laptev, Sivic, Bach and Ponce,
"Automatic Annotation of Human Actions in
Video", In ICCV 2009.
…wanted to know about the history of the trees
![Page 28: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/28.jpg)
Joint Learning of Actors and Actions
Rick? Rick?
Walks?Walks?
[Bojanowski et al. ICCV 2013]
Rick walks up behind Ilsa
![Page 29: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/29.jpg)
Rick
Walks
Rick walks up behind Ilsa
Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]
![Page 30: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/30.jpg)
Formulation: Cost function
Rick
Ilsa
Sam
Actor labels Actor image features
Actor classifier
![Page 31: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/31.jpg)
Formulation: Cost function
Person p appears at
least once in clip N :
p = Rick
Weak supervision
from scripts:
![Page 32: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/32.jpg)
Action a appears at
least once in clip N :
a = Walk
Weak supervision
from scripts:
Formulation: Cost function
![Page 33: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/33.jpg)
Formulation: Cost function
Action a
appears
in clip N :
Weak supervision
from scripts:
Person p
appears in
clip N :
Person p
and
Action a
appear in
clip N :
![Page 34: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/34.jpg)
34
Image and video features
• Facial features
[Everingham’06]
• HOG descriptor on
normalized face image
• Dense Trajectory
features in person
bounding box
[Wang et al.,’11]
Face features
Action features
![Page 35: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/35.jpg)
35
Results for Person Labelling
American beauty (11 character names)Casablanca (17 character names)
![Page 36: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/36.jpg)
36
Results for Person + Action Labelling
Casablanca,
Walking
![Page 37: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/37.jpg)
Finding Actions and Actors in Movies
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
![Page 38: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/38.jpg)
38
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
![Page 39: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/39.jpg)
39
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
![Page 40: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/40.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 41: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/41.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 42: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/42.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 43: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/43.jpg)
Is the optimization tractable?
• Path constraints are implicit
• Cannot use off-the-shelf solvers
• Frank-Wolfe optimization algorithm
![Page 44: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/44.jpg)
Results
937 video clips from 60 Hollywood movies•
16 action classes•
Each clip is annotated by a sequence of n actions (2≤n≤11)•
![Page 45: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/45.jpg)
![Page 46: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/46.jpg)
Object recognition
![Page 47: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/47.jpg)
Convolutional Neural Networks
• ImageNet Large-Scale Visual Recognition Challenge is
very hard: 1000 classes, 1.2M images
• Krizhevsky et al. ILSVRC12 results improve other
methods with a large margin
2012
2014
GoogleLeNet: 6%
![Page 48: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/48.jpg)
CNN of Krizhevsky et al. NIPS’12
• Learns low-level features at
the first layer.
• Has some tricks but the main
principle is similar to LeCun’88
• Has 60M parameters and 650K neurons.
• Success seems to be determined by (a) lots of labeled
images and (b) very fast GPU implementation. Both (a)
and (b) have not been available until very recently.
![Page 49: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/49.jpg)
Approach
1. Design training/test procedure using sliding windows
2. Train adaptation layers to map labelsSee also [Girshick et al.’13], [Donahue et al.’13], [Sermanet et al. ’14], [Zeiler and Fergus ’13]
Transfer learning workshop at ICCV’13, ImageNet workshop at ICCV’13
![Page 50: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/50.jpg)
Approach – sliding window training / testing
![Page 51: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/51.jpg)
Results
Object localization
![Page 52: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/52.jpg)
Results
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
![Page 53: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/53.jpg)
Results
![Page 54: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/54.jpg)
Vision works?
![Page 55: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/55.jpg)
Vision works?
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
![Page 56: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/56.jpg)
VOC Action Classification Taster
Challenge
Given the bounding box of a person, predict
whether they are performing a given action
Playing Instrument? Reading?
Encourage research on still-image activity
recognition: more detailed understanding of image
![Page 57: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/57.jpg)
Nine Action Classes
Phoning Playing Instrument Reading Riding Bike Riding Horse
Running Taking Photo Using Computer Walking
![Page 58: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/58.jpg)
CNN action recognition and
localizationQualitative results: reading
![Page 59: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/59.jpg)
CNN action recognition and
localizationQualitative results: phoning
![Page 60: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/60.jpg)
CNN action recognition and
localizationQualitative results: playing instrument
![Page 61: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/61.jpg)
Results PASCAL VOC 2012
Object classification
Action classification
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
![Page 62: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/62.jpg)
Are bounding boxes needed for training CNNs?
Image-level labels: Bicycle, Person[Oquab, Bottou, Laptev, Sivic, 2014]
![Page 63: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/63.jpg)
Motivation: labeling bounding boxes is tedious
![Page 64: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/64.jpg)
Motivation: image-level labels are plentiful
“Beautiful red leaves in a back street of Freiburg”
[Kuznetsova et al., ACL 2013]
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
![Page 65: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/65.jpg)
Motivation: image-level labels are plentiful
“Public bikes in Warsaw during night”
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
![Page 66: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/66.jpg)
Let the algorithm localize the object in the image
[Oquab, Bottou, Laptev, Sivic, 2014]
Example training images with bounding boxes
The locations of objects or their parts learnt by the CNN
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
![Page 67: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/67.jpg)
Approach: search over object’s location
1. Efficient window sliding to find object location hypothesis
2. Image-level aggregation (max-pool)
3. Multi-label loss function (allow multiple objects in image)
See also [Sermanet et al. ’14] and [Chaftield et al.’14]
Max-pool
over image
Per-image score
FCaFCbC1-C2-C3-C4-C5 FC6 FC7
4096-dim
vector9216-dim
vector
4096-dim
vector
…
motorbike
person
diningtable
pottedplant
chair
car
bus
train
…
Max
![Page 68: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/68.jpg)
1. Efficient window sliding to find object location
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
![Page 69: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/69.jpg)
2. Image-level aggregation using global max-pool
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
![Page 70: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/70.jpg)
3. Multi-label loss function
(to allow for multiple objects in image)108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
Sum of K (=20) log-loss functions, one for each of K classes:
K-vector of network output
for image x
K-vector of (+1,-1) labels indicating
presence/absence of each class
![Page 71: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/71.jpg)
Search for objects using max-poolingMax-pooling search
aeroplane map
car map
«Keep up the good work !»
(increase score)
«Wrong !»(decrease score)
«Found something there !» Receptive field of the maximum-
scoring neuron
max-pool
max-pool
mardi 10 juin 14
Correct label:
increase score
for this class
Incorrect label:
decrease score
for this class
![Page 72: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/72.jpg)
Search for objects using max-pooling
a
What is the effect of errors?
![Page 73: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/73.jpg)
Multi-scale training and testing162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
Rescale
[ 0.7…1.4 ]
chairdiningtablesofapottedplantpersoncarbustrain…
Figure 3: Weakly supervised training
chairdiningtablepersonpottedplantpersoncarbustrain…
Rescale
Figure 4: Multiscale object recognition
Convolutional adaptation layers. The network architecture of [26] assumes a fixed-size imagepatch of 224⇥224 RGB pixelsas input and outputs a1⇥1⇥N vector of per-class scores asoutput,whereN is the number of classes. The aim is to apply the network to bigger images in a slidingwindow manner thus extending its output to n ⇥m ⇥N where n and m denote the number ofsliding window positions in the x- and y- direction in the image, respectively, computing the Nper-class scores at all input window positions. While this type of sliding was performed in [26] byapplying the network to independently extracted image patches, here we achieve the same effect bytreating thefully connected adaptation layersasconvolutions. For agiven input imagesize, the fullyconnected layer can be seen as a special case of a convolution layer where the size of the kernel isequal to the size of the layer input. With this procedure the output of the final adaptation layer FC7becomes a2⇥2⇥N output score map for a256⇥256 RGB input image, asshown in Figure 2. Asthe global stride of the network is 32 pixels, adding 32 pixels to the image width or height increasesthe width or height of the output score map by one. Hence, for example, a 2048⇥1024 pixel inputwould lead to a 58⇥26 output score map containing the score of the network for all classes for thedifferent locations of the input 224⇥224 window with astride of 32 pixels. While this architectureis typically used for efficient classification at test time, see e.g. [32], here we also use it at trainingtime (as discussed in Section 4) to efficiently examine the entire image for possible locations of theobject during weakly supervised training.
Explicit search for object’s position via max-pooling. The aim is to output a single image-levelscore for each of the object classes independently of the input image size. This is achieved byaggregating the n ⇥m ⇥N matrix of output scores for n ⇥m different positions of the inputwindow using aglobal max-pooling operation into asingle1⇥1⇥N vector, whereN is thenumberof classes. Note that the max-pooling operation effectively searches for the best-scoring candidateobject position within the image, which is crucial for weakly supervised learning where the exactposition of the object within the image is not given at training. In addition, due to the max-poolingoperation the output of the network becomes independent of the size of the input image, which willbeused for multi-scale learning in Section 4.
Multi-label classification cost function. The Pascal VOC classification task consists in tellingwhether at least one instance of aclass is present in the image or not. We treat the task as a separate
4
![Page 74: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/74.jpg)
Training
videos
![Page 75: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/75.jpg)
Test results on 80 classes in Microsoft COCO dataset
![Page 76: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/76.jpg)
Test results on 80 classes in Microsoft COCO dataset
![Page 77: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/77.jpg)
Test results on 80 classes in Microsoft COCO dataset
![Page 78: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/78.jpg)
Test results on 80 classes in Microsoft COCO dataset
![Page 79: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/79.jpg)
Test results on 80 classes in Microsoft COCO dataset
![Page 80: Computer Vision](https://reader036.vdocument.in/reader036/viewer/2022062513/559137411a28ab14498b45c9/html5/thumbnails/80.jpg)
Test results on 80 classes in Microsoft COCO dataset