deep learning lab: computer vision -...
TRANSCRIPT
![Page 1: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/1.jpg)
1 / 40
Deep Learning Lab: Computer Vision
Christian Zimmermann and Silvio Galesso
![Page 2: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/2.jpg)
2 / 40
Outline
● Introduction to Human Pose Estimation (HPE)● Single HPE● Multi HPE
– Top down– Bottom up
● Semantic Segmentation● Exercise
![Page 3: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/3.jpg)
3 / 40
Intro: What is HPE?
● Given a single color image infer body pose:
HPE
![Page 4: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/4.jpg)
4 / 40
Intro: What is HPE?
● Given a single color image infer body pose:
● 2D pose P is defined as:
HPE
P=(p1⋮pn
)=(x1 y1⋮ ⋮xn yn
)∈ℝnx 2
y
x
![Page 5: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/5.jpg)
5 / 40
P=(p1⋮pn
)=(x1 y1⋮ ⋮xn yn
)∈ℝnx 2
Intro: What is HPE?
● Given a single color image infer body pose:
● 2D pose P is defined as:
HPE
f.e. “nose”
y
x
![Page 6: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/6.jpg)
6 / 40
Intro: Why HPE?
● Human machine interaction:
– Autonomous driving: Infer People and their heading direction and intentions
From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019
![Page 7: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/7.jpg)
7 / 40
Intro: Why HPE?
● Human machine interaction:
– Autonomous driving: Infer People and their heading direction and intentions
– Pose based gaming: Microsoft Xbox Kinect
From: xbox.com
![Page 8: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/8.jpg)
8 / 40
Intro: Why HPE?
● Human machine interaction:
– Autonomous driving: Infer People and their heading direction and intentions
– Pose based gaming: Microsoft Xbox Kinect– In robotics: Learning from demonstration
From: Zimmermann et al., 3D Human Pose Estimation in RGBD Images for Robotic Task Learning, ICRA 2018
![Page 9: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/9.jpg)
9 / 40
Intro: Why HPE?
● Human machine interaction● Quantify movement:
– Sport action analysis● Track players during sports
– Medicine ● Whats the stage of the ALS disease?
![Page 10: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/10.jpg)
10 / 40
Intro: What makes it hard?
● Large variation in appearance● Ambiguities
![Page 11: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/11.jpg)
11 / 40
Intro: What makes it hard?
● Large variation in appearance● Ambiguities● Occlusions● Crowding
![Page 12: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/12.jpg)
12 / 40
Single HPE: Regression
● Directly regress Cartesian image coordinates ● Network outputs one vector of coordinates
From: Toshev et al., DeepPose: Human Pose Estimation via Deep Neural Networks, CVPR 2014
L=∑i
(‖Pi− Pi‖2)2
Ground truthPrediction
![Page 13: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/13.jpg)
13 / 40
Single HPE: Scoremap
● Network estimates per pixel keypoint likelihood
● For each keypoint there is one map
● Ground truth maps are created from point annotations
From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016
S∈ℝMxMxK
I∈ℝNxNx 3
![Page 14: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/14.jpg)
14 / 40
Single HPE: Scoremap
● Network estimates per pixel keypoint likelihood
● For each keypoint there is one map
● Ground truth maps are created from point annotations
S∈ℝMxMxK
I∈ℝNxNx 3
From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016
L=∑i
K
(‖Si− S i‖2)2
Ground truthPrediction
![Page 15: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/15.jpg)
15 / 40
Single HPE: Scoremap
● Network estimates per pixel keypoint likelihood
● For each keypoint there is one map
● Ground truth maps are created from point annotations
S∈ℝMxMxK
I∈ℝNxNx 3
From: Newell et al., Stacked Hourglass Networks for Human Pose Estimation, ECCV 2016
Si(~p )=exp(
−‖~p− p‖2σ2 )
L=∑i
K
(‖Si− S i‖2)2
![Page 16: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/16.jpg)
16 / 40
Single HPE: Softargmax
● Heat maps are learned implicitly
● Softmax squashed map into a probability distribution
● Elementwise multiplication and summation reduces to predicted coordinate
From: Luvizon, 2d/3d pose estimation and action recognition using multitask deep learning, CVPR 2018
![Page 17: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/17.jpg)
17 / 40
Multi HPE
![Page 18: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/18.jpg)
18 / 40
Multi HPE
● Top-Down– Detects persons first
– Estimates pose for each person independently
– Suffers early commitment
– Runtime scales linear in #people
– Struggles when people crowd
● Bottom-Up– Detects keypoints first
– Subsequently groups keypoints into indivuals
– Makes keypoint detection harder because of less prior knowledge
– Extensive grouping is NP hard problem
![Page 19: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/19.jpg)
19 / 40
Top-down approach
● Example for top-down: Mask R-CNN– First part of the network detects bounding boxes
– Then pool features from each bounding box and apply a sub-network (‘head’) on them
– There are heads for classification, segmentation and pose
estimation
From: He et al., Mask R-CNN, ICCV 2017
![Page 20: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/20.jpg)
20 / 40
Bottom-up approach
● Single network that estimates two entities:– Keypoint locations (Part Intensity Field, also called
Scoremap) → Gives joint estimates
– Association scores between keypoints that should form a limb (Part Affinity Fields) → Enables grouping
From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019
![Page 21: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/21.jpg)
21 / 40
Bottom-up approach
● Part Intensity Fields– Likelihood at each location if the keypoint is present
– One Scoremap per keypoint needed
From: Kreiss et al., PifPaf: Composite Fields for Human Pose Estimation, CVPR 2019
![Page 22: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/22.jpg)
22 / 40
Bottom-up approach
● Part Affinity Fields– Vector field pointing in the direction from ‘start’ to ‘end’ of a
limb
– Two maps per keypoint (one for each vector component)
From: Cao et al., Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, CVPR 2017
![Page 23: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/23.jpg)
23 / 40
Bottom-up approach
● Grouping keypoint candidates to instances– Finding the optimal parse of from the detected keypoint
candidates is NP-Hard.
– Therefore relax complete matching to greedy bipartite graph matching: i.e. match one limb at a time
● Practical implementation:– Start with the most confident keypoint locations
– Greedy growing of the person instance using the PAF based score
![Page 24: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/24.jpg)
24 / 40
Evaluation
● Common measure: Mean per joint position error (MPJPE)
PredictionGround truth
MPJPE=1
|V|∑i∈V
K
‖pi− pi‖2
Set of visible keypoints
![Page 25: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/25.jpg)
25 / 40
Semantic Segmentation
![Page 26: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/26.jpg)
26 / 40
Image Segmentation
Input image Segmentation mask
Definition: partitioning the image into coherent regions/subsets of pixels
Source: COCO dataset, http://cocodataset.org/#explore
![Page 27: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/27.jpg)
27 / 40
Segmentation Tasks● Binary
segmentation● Semantic
segmentation● Instance
segmentation
● Assign a class to each pixel
● 2 classes: foreground/background
● Assign a class to each pixel
● Multiple classes with semantic meaning: person, dog, sheep, pig, background, ...
● Predict segmentation mask for foreground objects
● Instance specific (usually coupled with detection)
Sources: Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015He et al., Mask R-CNN, ICCV 2017
![Page 28: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/28.jpg)
28 / 40
Segmentation with CNNs
Encoder-Decoder architecture:
Input image
Encoder Decoder
Prediction:segmentation
mask
Bottleneck
![Page 29: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/29.jpg)
29 / 40
Encoder Network
Conv. Layers (+ReLU)
Pooling layers
Input: ● Original resolution ● Low level representation (RGB)
Output/bottleneck: ● Low resolution● High receptive field● High level feature representation
(high number of channels)
![Page 30: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/30.jpg)
30 / 40
Decoder Network
Transposed Conv. Layers (+ReLU)
Upsampling layers
● Uses feature representation to solve task
● Increases resolution via upsampling operations and/or transposed convolutions
![Page 31: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/31.jpg)
31 / 40
Transposed Convolutions● Also known as upconvolutions or deconvolutions● They map the input to a higher resolution output● Can be seen as “learned upsampling” operations
“Transposed” because:
convolutional filter
convolutional matrix
![Page 32: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/32.jpg)
32 / 40
Transposed Convolutions● Also known as upconvolutions or (wrongly) deconvolutions● They map the input to a higher resolution output● Can be seen as “learned upsampling” operations
“Transposed” because:
Convolution:
Transposed convolution:
it can be computed using the transposed matrix of some convolution.
![Page 33: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/33.jpg)
33 / 40
Skip Connections
skip
skip
● “Shortcuts” from encoder activations to corresponding decoder stages● Preserve high-res information, useful for refinement● Improve sharpness of output
![Page 34: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/34.jpg)
34 / 40
Skip Connections
skip
concatenation
64 x
64
64 x
64rest of the network...
......
![Page 35: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/35.jpg)
35 / 40
Example: U-Net
Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015
![Page 36: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/36.jpg)
36 / 40
Example: ECRU
Source: Robail Yasrab, “ECRU: An Encoder-Decoder Based Convolution Neural Network (CNN) for Road-Scene Understanding”, Journal of Imaging 2018
![Page 37: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/37.jpg)
37 / 40
Training for Segmentation
mask prediction
last decoder layer
conv. sigmoid activation
ground truth mask
Cross Entropy
Loss
![Page 38: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/38.jpg)
38 / 40
Evaluating SegmentationCommon evaluation metric for detection and segmentation: Intersection over Union (IoU)
E.g. in object detection:
intersection union
predictionground
truth
![Page 39: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/39.jpg)
39 / 40
A Few References
● Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI 2015
● Robail Yasrab, “ECRU: An Encoder-Decoder Based Convolution Neural Network (CNN) for Road-Scene Understanding”, Journal of Imaging 2018
● He et al., “Mask R-CNN”, ICCV 2017
● He et al., Deep Residual Learning for Image Recognition, CVPR 2016
● Dumoulin et al., A guide to convolution arithmetic for deep learning, arXiv 1603.07285
![Page 40: Deep Learning Lab: Computer Vision - uni-freiburg.deais.informatik.uni-freiburg.de/teaching/ss19/deep_learning_lab/presentation_lectureCV.pdfHuman machine interaction: – Autonomous](https://reader033.vdocument.in/reader033/viewer/2022042302/5ecd5e8ba9671c5f1d4b8d3d/html5/thumbnails/40.jpg)
40 / 40
Exercise
● Set up and train a model for pose estimation using the direct scalar regression approach.
● Implement the Softargmax loss and use it to train a second model. Compare it with the previous approach.
● Implement different encoder-decoder networks for segmentation and compare their performance.