3d vision - cvg.ethz.ch · microsoft hololens mixed reality. ... collaboration with microsoft...
TRANSCRIPT
3D Vision
• Understanding geometric relations
• between images and the 3D world
• between images
• Obtaining 3D information describing our 3D world
• from images
• from dedicated sensors
3D Vision
• Extremely important in robotics and AR / VR
• Visual navigation
• Sensing / mapping the environment
• Obstacle detection, …
• Many further application areas
• A few examples …
Raw Kinect Output:
Color + Depth
http://grouplab.cpsc.ucalgary.ca/cookbook/index.php/Technologies/Kinect
Interactive 3D Modeling
(Sinha et al. Siggraph Asia 08)
collaboration with Microsoft Research (and licensed to MS)
Johannes Schönberger
CAB G [email protected]
Martin Oswald
Torsten Sattler
Federico Camposeco
CAB G [email protected]
Peidong Liu
CAB G [email protected]
Nikolay Savinov
CAB G [email protected]
3D Vision Course Team
Katarina Tóthóva
CAB G [email protected]
• To understand the concepts that relate images to the 3D world and images to other images
• Explore the state of the art in 3D vision
• Implement a 3D vision system/algorithm
Course Objectives
Learning Approach
• Introductory lectures:• Cover basic 3D vision concepts and approaches.
• Further lectures:• Short introduction to topic• Paper presentations (you)
(seminal papers and state of the art, related to your projects)
• 3D vision project:• Choose topic, define scope (by week 4)
• Implement algorithm/system• Presentation/demo and paper report
Grade distribution• Paper presentation & discussions: 25%• 3D vision project & report: 75%
Slides and more http://www.cvg.ethz.ch/teaching/3dvision/
Also check out on-line “shape-from-video” tutorial:http://www.cs.unc.edu/~marc/tutorial.pdf
http://www.cs.unc.edu/~marc/tutorial/
Textbooks:• Hartley & Zisserman, Multiple View Geometry
• Szeliski, Computer Vision: Algorithms and Applications
Materials
Feb 19 Introduction
Feb 26 Geometry, Camera Model, Calibration
Mar 5 Features, Tracking / Matching
Mar 12 Project Proposals by Students
Mar 19 Structure from Motion (SfM) + papers
Mar 26 Dense Correspondence (stereo / optical flow) + papers
Apr 2 Bundle Adjustment & SLAM + papers
Apr 9 Student Midterm Presentations
Arp16 Easter break
Apr 23 Multi-View Stereo & Volumetric Modeling + papers
Apr 30 Whitsundite
May 7 3D Modeling with Depth Sensors + papers
May 14 3D Scene Understanding + papers
May 21 4D Video & Dynamic Scenes + papers
May 28 Student Project Demo Day = Final Presentations
Schedule
• Know 2D/3D correspondences, compute projection matrix
also radial distortion (non-linear)
Camera Calibration
Harris corners, KLT features, SIFT features
key concepts: invariance of extraction, descriptors
to viewpoint, exposure and illumination changes
Feature Tracking and Matching
Initialize Motion (P1,P2 compatibel with F)
Initialize Structure (minimize reprojection error)
Extend motion(compute pose through matches seen in 2 or more previous views)
Extend structure(Initialize new structure,refine existing structure)
Structure from Motion
Stereo and RectificationWarp images to simplify epipolar geometry
Compute correspondences for all pixels
Joint 3D Reconstruction and Class Segmentation(Haene et al CVPR13)
joint reconstruction and segmentation (ground, building, vegetation, stuff)
reconstruction only(isotropic smoothness prior)
■ Building ■ Ground■ Vegetation ■ Clutter
Papers and Discussion
• Will cover recent state of the art• Each student team will present a paper (5min
per team member), followed by discussion
• “Adversary” to lead the discussion
• Papers will be related to projects/topics
• Will distribute papers later (depending on chosen projects)
Projects and reports
• Project on 3D Vision-related topic• Implement algorithm / system
• Evaluate it
• Write a report about it
• 3 Presentations / Demos:• Project Proposal Presentation (week 4)
• Midterm Presentation (week 8)
• Project Demos (week 15)
• Ideally: Groups of 3 students
Goal:
Description:
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent Convolutional Neural Networks
The goal is to implement a deep recurrent convolutional neural network for end-to-end visual
odometry [1]
Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching,motion estimation, local optimization, etc. Although some of them have demonstrated superior performance, theyusually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior
knowledge is also required to recover an absolute scale for monocular VO. This project is to implement a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). Since it is trained
and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) withoutadopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective
feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequentialdynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset showcompetitive performance to state-of-the-art methods, verifying that the end-to-end Deep Learning technique can be a
viable complement to the traditional VO systems.
[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA 2017
Recommended : Python and prior knowledge in machine learning
Peidong Liu, CNB D102
Goal:
Description:
Deep Relative Pose Estimation
for Stereo Camera
Design a neural network to estimate the relative pose between two frames for a stereo camera.
Recently there is some work on relative pose estimation between two images/frames based on neural network, which aims for the
application in autonomous driving. However, compared to traditional geometric methods (e.g. 5-point algorithm), these methods
have much worse accuracy. With a stereo camera we can obtain two frames captured at the same time, and recover the depth for
each frame without scale ambiguity. This would help the pose estimation.This project aims to design a neural network to estimate the relative pose between two frames for a stereo camera. The students
will start from learning the existing neural network for disparity/depth estimation and pose estimation for the monocular camera.
Then they will focus on the design of neural network for the stereo camera.
[1] Zhou T, Brown M, Snavely N, Lowe DG. Unsupervised learning ofdepth and ego-motion from video. In CVPR 2017.
[2] Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T. Demon: Depth and motion network for learningmonocular stereo. In CVPR 2017.
[3] Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A, Brox T. A large dataset to train convolutional networks fordisparity, optical flow, and scene flow estimation. In CVPR 2016.
Required: Python, Linux
Recommended: Experience with TensorFlow, PyTorch or other deep
learning frameworks
Zhaopeng Cui
CNB G104
Goal:
Description:
DeepVO: Towards End-to-End Visual Odometry with
Deep Recurrent Convolutional Neural Networks
The goal is to implement a deep recurrent convolutional neural network for end-to-end visual
odometry [1]
Most of existing VO algorithms are developed under a standard pipeline including feature extraction,
feature matching, motion estimation, local optimization, etc. Although some of them have demonstrated
superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in
different environments. Some prior knowledge is also required to recover an absolute scale for monocular
VO. This project is to implement a novel end-to-end framework for monocular VO by using deep Recurrent
Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it
infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the
conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature
representation for the VO problem through Convolutional Neural Networks, but also implicitly models
sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the
KITTI VO dataset show competitive performance to state-of- the-art methods, verifying that the end-to-end
Deep Learning technique can be a viable complement to the traditional VO systems.
[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA
2017
Recommended : Python and prior knowledge in machine learning
Peidong Liu, CNB D102
Goal:
Description:
Differential Rolling-Shutter SfM
Model the Rolling Shutter (RS) to create RS
artifact free images
It is well know that moving RS cameras create distorted images. The effect is typically visible when vertical structures
appear slanted.
In this work we want to model the RS effect and compensate for it. The input to the algorithm is a short image burst,
from which we will first compute the optical flow, then estimate the camera pose and camera motion parameters.
Finally we want to create a global shutter image by warping the RS image over the estimated depth into a global
shutter reference frame.
[1] Zhuang et al., “Rolling-Shutter-Aware Differential SfM and Image Rectification”, ICCV 2017
Required: C++, some experience with image processing
Recommended: Experience with OpenCV
Olivier Saurer <[email protected]>
Goal:
Description:
3DScanBox
Build a multi-camera 3D scan box
Implement a simple 3D scanner by using an aluminum
frame and a bunch of cameras. The necessary material is
provided.
Depending on the group size and interest the focus of the
project can be put on different aspects. Ranging from
multi-camera online calibration to multi-view stereo or
fusion.
Required: C++, some experience with image processing
Recommended: OpenCV, maybe Google Ceres
Petri Tanskanen, [email protected]
Goal:
Description:
Camera Pose Estimation
For Artistic Purposes
Create a Blender plugin that finds poses and focal lengths
(extrinsic and intrinsic parameters) of a set of reference
images.
3D artists take reference images of scenes and objects they want to model. For modelling it is helpful to align the
virtual cameras in the modelling tool of choice (Blender, Maya, …) to the reference images.
There exists a Blender plugin[1] that utilizes two-point perspective and user input to find pose and focal length of a
single reference image.
This project aims to implement an SfM plugin to find relative poses and focal lengths of a set of reference images.
Open questions:
rely on user input for point matches or use SIFT + Feature Matching?
images might need to be undistorted since Blender’s camera does not model this phenomenon
[1] Per Gantelius, “BLAM”, https://github.com/stuffmatic/blam
Required: Python, C++
Recommended: Blender, OpenCV or COLMAP
Daniel Thul
Goal:
Description:
Transfer from Recognition to
Optical Flow by Matching Neural
Paths
Implementation of Optical Flow Method by Matching Neural Paths
The goal is to extend the stereo method of Savinov et al [1] to optical flow. The main challenge is with handling of
large memory requirements by passing only restricted subset of most probable labels during the back-propagation
phase of label likelihoods. The method could be implemented in any deep learning framework.
[1] Savinov et al., “Matching Neural Paths: Transfer from Recognition to Correspondence Search”, NIPS 2017
Required: C++, CUDA, any Deep learning framework Lubor Ladicky, [email protected]
Goal:
Description:
Navigation by
Reinforcement Learning
Benchmark different RL algorithms on their ability to learn to navigate to a goal
You will take one of the popular RL libraries like Tensorforce [2] or OpenAI Baselines [3] and benchmark them on
3D navigations tasks proposed in [1]. Those tasks are implemented as maps in a Vizdoom environment.
The agent is given a high reward for reaching the image-specified goal and small reward for collecting items like
healthkits (which should ignite his curiosity and make him explore). His goal is to maximize rewards.
You will compare the following RL methods: A3C, A2C, PPO.
[1] Savinov et al., "Semi-parametric topological memory for navigation", ICLR 2018, https://openreview.net/pdf?id=SygwwGbRW
[2] https://github.com/reinforceio/tensorforce
[3] https://github.com/openai/baselines
Required: python
Recommended: knowledge in Machine Learning, experience with
tensorflow and RL
Nikolay Savinov
Goal:
Description:
Appearance Representation based
on Auto-Encoders
Improve appearance model with deep auto-encoder
This project aims to build efficient appearance representations of shapes observed from
multiple viewpoints and over time. Recent work [1] has addressed this using Principal
Components Analysis (PCA). The goal of this project is to explore, as an alternative,
deep auto-encoders for dimensionality reduction.
The students will build on existing tools using MATLAB and python / tensorflow to
explore appearance representations obtained from auto-encoders and compare the results
to [1].
[1] Boukhayma et al., “Eigen appearance maps of dynamic shapes”, ECCV 201
Required: Python and MATLAB, some experience with image processing
Recommended: Some experience with deep learning / tensorflow
Dr. Vagia Tsiminaki ([email protected])
Dr. Lisa Koch ([email protected])
Auto-encoder for dimensionality reduction
Goal:
Description:
SuperPoint: Self-Supervised Interest Point Detection
and Description
The goal is to implement a self-supervised fully convolutional neural network for interest
point detection and description [1]
This project is to implement a self-supervised framework for training interest point detectors and
descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed
to patch-based neural networks, this fully-convolutional model operates on full-sized images and jointly
computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce
Homographic Adaptation, a multi-scale, multi- homography approach for boosting interest point detection
repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on
the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much
richer set of interest points than the initial pre-adapted deep model and any other traditional corner
detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when
compared to LIFT, SIFT and ORB.
[1] DeTone et. al., SuperPoint: Self-Supervised Interest Point Detection and Description, arXiv 2017
Recommended : Python and prior knowledge in machine learning Peidong Liu, CNB D102
Goal:
Description:
Real-Time Surface Reconstruction
Perform depth-map fusion directly into the mesh.
Traditional approaches rely on volumetric representations or point-clouds to represent
the environment and fuse the different depth measurements.
In this work we want to use a mesh representation to model the environment. The goal
is to fuse new depth estimates directly into the mesh. Adaptive tessellation is used to
represent different levels of geometric details in the scene.
The input to the algorithm is a set of calibrated RGBD images. The focus of the
is the implementation of the fusion algorithm. For this we will closely follow
Yienkiewicz et al. [1].
[1] Zienkiewicz et al., “Monocular, Real-Time Surface Reconstruction using Dynamic Level of
Detail”, 3DV 2016
Required: C++, some experience with image processing
Recommended: Experience with OpenCVOlivier Saurer <[email protected]>
Goal:
Description:
Data Generation with a Virtual
Simulator for Autonomous Driving
Generate 3D training data with a recent open urban
driving simulator for autonomous driving.
Recently there is an open-source simulator for autonomous driving research, which is named as
CARLA [1]. This simulator supports real-time data acquisition of RGB image, semantic
segmentation, and depth map, which can be used as the training data for the deep learning
methods.
This project aims to utilize the virtual simulator to generate more kinds of training data including
2D/3D instance-level segmentation, 3D bounding boxes, 3D shapes and poses of vehicles, etc.,
which will be used for the training of deep 3D detection methods.
[1] Dosovitskiy A, Ros G, Codevilla F, López A, Koltun V. CARLA: An open urban driving simulator. Conference onRobot Learning (CoRL), 2017
Required: C++, Python
Recommended: Familiar with UE4 programming
Zhaopeng Cui
CNB G104
Goal:
Description:
Data fusion for semantic
3D reconstruction
Improve the data fusion pipeline for semantic
3D reconstruction using a learning approach
Semantic 3D reconstruction is the task of jointly reconstructing and segmenting a 3D model. It has been shown in [1] that both
tasks can benefit from each other: the 3D structure offers ground for regularizing the segmentation, while the semantic information
gives access to shape priors (eg ground is flat and horizontal…).
Methods presented in [1] or [2] take as input multiple depth maps and corresponding 2D semantic segmentations, and fuse them
into a modified Truncated Signed Distance Function (TSDF) [3]. Though very efficient, this fusion could be improved in order to
obtain better input for the methods.
In this project we propose to leverage the availability of semantic 3D data, and machine learning libraries such as tensorflow, in
order to learn a method to fuse the data used for semantic 3D segmentation.
[1] Dense Semantic 3D Reconstruction, Häne et al., TPAMI 2017
[2] Learning Priors for semantic 3D reconstruction, Cherabier et al., unpublished 2018
[3] A volumetric method for building complex models from range images, Curless et Levoy, SIGGRAPH 1996
Required: Python, Tensorflow
Recommended: Optimization (Maths), C++
Ian Cherabier ([email protected])
Martin Oswald ([email protected])
Goal:
Description:
Surface Reconstruction in Medical Imaging:
Data and CNNs
Creation of synthetic MR datasets and their use in testing of
various surface reconstruction architectures
Required: Python, Matlab
Recommended: Experience with machine learning and TensorFlow
Katarina Tothova
Reconstruction of organ surfaces is an important task in medical image analysis,
especially in cardiac and neuro-imaging. Besides their significance in diagnosis and
surgical planning, high-quality organ surface models provide powerful measures for
statistical analysis or disease tracking.
Thanks to recent advances in machine learning, we are devising a deep neural
network–based approach for direct organ surface reconstruction from MRI data.
To test the efficacy of the proposed network architectures, it is necessary to design
and produce relevant synthetic MR data
Goal:
Description:
3D Appearance Super-resolution Benchmark
Generate appearance super-resolution benchmark datasets
This project aims to generate a Super-Resolution Appearance dataset and provide a systematic
benchmark for evaluation. Previous work [1] presents a framework for synthetic generation of
realistic benchmarks for 3D reconstruction from images. ETH3D Benchmark [2] covers a
variety of indoor and outdoor scenes. The goal of this project is to build on these works and
generate super-resolved appearance dataset for the multi-view case.
[1] A. Ley et al. “SyB3R: A Realistic Synthetic Benchmark for 3D Reconstruction from
Images” ECCV 2016
[2] T. Schöps et al. “A Multi-View Stereo Benchmark with High-Resolution Images and
Multi-Camera Videos” CVPR 2017
Required: Matlab/Python, some experience with image processing
Recommended: Experience with C++, scripting language
Dr. Vagia Tsiminaki ([email protected])
Goal:
Description:
Super-resolving Appearance of 3D Faces for
Dermatology App
Super-resolve appearance for mobile phone applications
This project aims to implement a super-resolution algorithm for appearance representations of 3D
faces. Previous work [1] presented a method to retrieve high resolution textures of objects observed
in multiple videos under small object deformations. The goal of this project is to implement the
proposed method for mobile phone applications where performance in terms of time and memory
are important.
The students will implement the super-resolution framework [1] using C++ .
The project can be build upon an existing C++/CUDA implementation of [2].
[1] Tsiminaki et al. “High resolution 3D shape texture from multiple videos” CVPR 2014
[2] D. Mitzel and T. Pock and T. Schoenemann and D. Cremers, Video Super Resolution using
Duality based TV-L1 Optical Flow, DAGM, pages 432-441, 2009
Required: C++, some experience with image processing
Recommended: Experience with Matlab
Dr. Vagia Tsiminaki ([email protected])
Dr. Martin Oswald ([email protected])
`
Goal:
Description:
Motion blur aware camera pose tracking
The goal is to implement a camera pose tracker for motion blurred images
Camera pose tracker is usually a front-end of a visual odometry (VO) algorithm. Most existing works
assume the input images to VO are sharp images. However, images can be easily blurred, which would
further fail the VO, if the camera moves too fast with a longer exposure time.
In this project, we plan to investigate and implement a motion blur aware camera pose tracker. To make
the problem tractable, we assume the reference image is sharp and only current image is being motion
blurred. Furthermore, we assume the depth map corresponding to the reference image is already known.
All the required dataset can be generated from a simulation tool, which is already being set up for you.
[1] Good programming skills in C++Peidong Liu and Vagia Tsiminaki, CNB D102
3D Vision, Spring Semester 2018
Goal:
Requirements / Tools: Supervisor:
Description:
Your Own Project
Learn about the techniques presented in the lecture
Choose your own topic!
Available hardware:
Google Tango TabletsMicrosoft HoloLens
GoPro Cameras
Intel RealSense Sensor
We find one for youRequired: Related to 3D Vision / topics of the lecture
Your Next Steps
• Find a group (ideally: groups of 3)
• Find a project (one of ours or your own)
• Topic subscription via doodle in a few days:
• For questions contact us via the lecture Moodle (preferred) or contact Nikolay per email
• First come first serve!
• Do not contact supervisors directly!
• After topic assignment: talk with your supervisor
• Write a project proposal
• Don’t worry: You’ll get reminders!
Feb 19 Introduction
Feb 26 Geometry, Camera Model, Calibration
Mar 5 Features, Tracking / Matching
Mar 12 Project Proposals by Students
Mar 19 Structure from Motion (SfM) + papers
Mar 26 Dense Correspondence (stereo / optical flow) + papers
Apr 2 Easter break
Apr 9 Bundle Adjustment & SLAM + papers
Apr 16 Student Midterm Presentations
Apr 23 Multi-View Stereo & Volumetric Modeling + papers
Apr 30 3D Modeling with Depth Sensors + papers
May 7 3D Scene Understanding + papers
May 14 4D Video & Dynamic Scenes + papers
May 21 Whitsundite
May 28 Student Project Demo Day = Final Presentations
Schedule