3d vision - cvg.ethz.ch · microsoft hololens mixed reality. ... collaboration with microsoft...

3D Vision

Torsten Sattler and Martin Oswald

Spring 2018

3D Vision

• Understanding geometric relations

• between images and the 3D world

• between images

• Obtaining 3D information describing our 3D world

• from images

• from dedicated sensors

3D Vision

• Extremely important in robotics and AR / VR

• Visual navigation

• Sensing / mapping the environment

• Obstacle detection, …

• Many further application areas

• A few examples …

Google Tango

(officially discontinued, lives on as ARCore)

Google Tango

Image-Based Localization

Geo-Tagging Holiday Photos

(Li et al. ECCV 2012)

Augmented Reality

(Middelberg et al. ECCV 2014)

Video credit: Johannes Schönberger

Large-Scale Structure-from-Motion

Virtual Tourism

UNC/UKY UrbanScape project

3D Urban Modeling

3D Urban Modeling

Mobile Phone 3D Scanner

Self-Driving Cars

Micro Aerial Vehicles

Microsoft HoloLens

Mixed Reality

Virtual Reality

Raw Kinect Output:

Color + Depth

http://grouplab.cpsc.ucalgary.ca/cookbook/index.php/Technologies/Kinect

Human-Machine Interface

3D Video with Kinect

Autonomous Micro-Helicopter Navigation

Use Kinect to map out obstacles and avoid collisions

Dynamic Reconstruction

Performance Capture

Performance Capture

(Oswald et al. ECCV 14)

Performance Capture

Motion Capture

Interactive 3D Modeling

(Sinha et al. Siggraph Asia 08)

collaboration with Microsoft Research (and licensed to MS)

Scanning Industrial Sites

as-build 3D model of off-shore oil platform

Scanning Cultural Heritage

Cultural Heritage

Stanford’s Digital Michelangelo

Digital archive

Art historic studies

accuracy ~1/500 from DV video (i.e. 140kb jpegs 576x720)

Archaeology

Forensics

• Crime scene recording and analysis

Forensics

Sports

Surgery

Johannes Schönberger

CAB G [email protected]

Martin Oswald

CNB [email protected]

Torsten Sattler

CNB [email protected]

Federico Camposeco


Peidong Liu


Nikolay Savinov


3D Vision Course Team

Katarina Tóthóva


• To understand the concepts that relate images to the 3D world and images to other images

• Explore the state of the art in 3D vision

• Implement a 3D vision system/algorithm

Course Objectives

Learning Approach

• Introductory lectures:• Cover basic 3D vision concepts and approaches.

• Further lectures:• Short introduction to topic• Paper presentations (you)

(seminal papers and state of the art, related to your projects)

• 3D vision project:• Choose topic, define scope (by week 4)

• Implement algorithm/system• Presentation/demo and paper report

Grade distribution• Paper presentation & discussions: 25%• 3D vision project & report: 75%

Slides and more http://www.cvg.ethz.ch/teaching/3dvision/

Also check out on-line “shape-from-video” tutorial:http://www.cs.unc.edu/~marc/tutorial.pdf

http://www.cs.unc.edu/~marc/tutorial/

Textbooks:• Hartley & Zisserman, Multiple View Geometry

• Szeliski, Computer Vision: Algorithms and Applications

Materials

Feb 19 Introduction

Feb 26 Geometry, Camera Model, Calibration

Mar 5 Features, Tracking / Matching

Mar 12 Project Proposals by Students

Mar 19 Structure from Motion (SfM) + papers

Mar 26 Dense Correspondence (stereo / optical flow) + papers

Apr 2 Bundle Adjustment & SLAM + papers

Apr 9 Student Midterm Presentations

Arp16 Easter break

Apr 23 Multi-View Stereo & Volumetric Modeling + papers

Apr 30 Whitsundite

May 7 3D Modeling with Depth Sensors + papers

May 14 3D Scene Understanding + papers

May 21 4D Video & Dynamic Scenes + papers

May 28 Student Project Demo Day = Final Presentations

Schedule

Fast Forward

• Quick overview of what is coming…

Pinhole camera

Geometric transformations in 2D and 3D

or

Camera Models and Geometry

• Know 2D/3D correspondences, compute projection matrix

also radial distortion (non-linear)

Camera Calibration

Harris corners, KLT features, SIFT features

key concepts: invariance of extraction, descriptors

to viewpoint, exposure and illumination changes

Feature Tracking and Matching

l2

C1m1

M?L1

m2

L2

M

C2

Triangulation

- calibration

- correspondences

3D from Images

Fundamental matrix Essential matrix

Also how to robustly compute from images

Epipolar Geometry

Initialize Motion (P1,P2 compatibel with F)

Initialize Structure (minimize reprojection error)

Extend motion(compute pose through matches seen in 2 or more previous views)

Extend structure(Initialize new structure,refine existing structure)

Structure from Motion

• Visual Simultaneous Navigation and Mapping

Visual SLAM

(Clipp et al. ICCV’09)

Stereo and RectificationWarp images to simplify epipolar geometry

Compute correspondences for all pixels

Multi-View Stereo

Joint 3D Reconstruction and Class Segmentation(Haene et al CVPR13)

joint reconstruction and segmentation (ground, building, vegetation, stuff)

reconstruction only(isotropic smoothness prior)

■ Building ■ Ground■ Vegetation ■ Clutter

Structured Light

• Projector = camera

• Use specific patterns to obtain correspondences

Papers and Discussion

• Will cover recent state of the art• Each student team will present a paper (5min

per team member), followed by discussion

• “Adversary” to lead the discussion

• Papers will be related to projects/topics

• Will distribute papers later (depending on chosen projects)

Projects and reports

• Project on 3D Vision-related topic• Implement algorithm / system

• Evaluate it

• Write a report about it

• 3 Presentations / Demos:• Project Proposal Presentation (week 4)

• Midterm Presentation (week 8)

• Project Demos (week 15)

• Ideally: Groups of 3 students

Course project example: Build your own 3D scanner!

Example: Bouguet ICCV’98

Project Topics

Goal:

Description:

DeepVO: Towards End-to-End Visual Odometry with

Deep Recurrent Convolutional Neural Networks

The goal is to implement a deep recurrent convolutional neural network for end-to-end visual

odometry [1]

Most of existing VO algorithms are developed under a standard pipeline including feature extraction, feature matching,motion estimation, local optimization, etc. Although some of them have demonstrated superior performance, theyusually need to be carefully designed and specifically fine-tuned to work well in different environments. Some prior

knowledge is also required to recover an absolute scale for monocular VO. This project is to implement a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs). Since it is trained

and deployed in an end-to-end manner, it infers poses directly from a sequence of raw RGB images (videos) withoutadopting any module in the conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective

feature representation for the VO problem through Convolutional Neural Networks, but also implicitly models sequentialdynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the KITTI VO dataset showcompetitive performance to state-of-the-art methods, verifying that the end-to-end Deep Learning technique can be a

viable complement to the traditional VO systems.

[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA 2017

Recommended : Python and prior knowledge in machine learning

Peidong Liu, CNB D102

[email protected]

Goal:

Description:

Deep Relative Pose Estimation

for Stereo Camera

Design a neural network to estimate the relative pose between two frames for a stereo camera.

Recently there is some work on relative pose estimation between two images/frames based on neural network, which aims for the

application in autonomous driving. However, compared to traditional geometric methods (e.g. 5-point algorithm), these methods

have much worse accuracy. With a stereo camera we can obtain two frames captured at the same time, and recover the depth for

each frame without scale ambiguity. This would help the pose estimation.This project aims to design a neural network to estimate the relative pose between two frames for a stereo camera. The students

will start from learning the existing neural network for disparity/depth estimation and pose estimation for the monocular camera.

Then they will focus on the design of neural network for the stereo camera.

[1] Zhou T, Brown M, Snavely N, Lowe DG. Unsupervised learning ofdepth and ego-motion from video. In CVPR 2017.

[2] Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T. Demon: Depth and motion network for learningmonocular stereo. In CVPR 2017.

[3] Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A, Brox T. A large dataset to train convolutional networks fordisparity, optical flow, and scene flow estimation. In CVPR 2016.

Required: Python, Linux

Recommended: Experience with TensorFlow, PyTorch or other deep

learning frameworks

Zhaopeng Cui

[email protected]

CNB G104

Goal:

Description:

DeepVO: Towards End-to-End Visual Odometry with

Deep Recurrent Convolutional Neural Networks

The goal is to implement a deep recurrent convolutional neural network for end-to-end visual

odometry [1]

Most of existing VO algorithms are developed under a standard pipeline including feature extraction,

feature matching, motion estimation, local optimization, etc. Although some of them have demonstrated

superior performance, they usually need to be carefully designed and specifically fine-tuned to work well in

different environments. Some prior knowledge is also required to recover an absolute scale for monocular

VO. This project is to implement a novel end-to-end framework for monocular VO by using deep Recurrent

Convolutional Neural Networks (RCNNs). Since it is trained and deployed in an end-to-end manner, it

infers poses directly from a sequence of raw RGB images (videos) without adopting any module in the

conventional VO pipeline. Based on the RCNNs, it not only automatically learns effective feature

representation for the VO problem through Convolutional Neural Networks, but also implicitly models

sequential dynamics and relations using deep Recurrent Neural Networks. Extensive experiments on the

KITTI VO dataset show competitive performance to state-of- the-art methods, verifying that the end-to-end

Deep Learning technique can be a viable complement to the traditional VO systems.

[1] Wang et. al., DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, ICRA

2017

Recommended : Python and prior knowledge in machine learning

Peidong Liu, CNB D102

[email protected]

Goal:

Description:

Differential Rolling-Shutter SfM

Model the Rolling Shutter (RS) to create RS

artifact free images

It is well know that moving RS cameras create distorted images. The effect is typically visible when vertical structures

appear slanted.

In this work we want to model the RS effect and compensate for it. The input to the algorithm is a short image burst,

from which we will first compute the optical flow, then estimate the camera pose and camera motion parameters.

Finally we want to create a global shutter image by warping the RS image over the estimated depth into a global

shutter reference frame.

[1] Zhuang et al., “Rolling-Shutter-Aware Differential SfM and Image Rectification”, ICCV 2017

Required: C++, some experience with image processing

Recommended: Experience with OpenCV

Olivier Saurer <[email protected]>

Goal:

Description:

3DScanBox

Build a multi-camera 3D scan box

Implement a simple 3D scanner by using an aluminum

frame and a bunch of cameras. The necessary material is

provided.

Depending on the group size and interest the focus of the

project can be put on different aspects. Ranging from

multi-camera online calibration to multi-view stereo or

fusion.


Recommended: OpenCV, maybe Google Ceres

Petri Tanskanen, [email protected]

Goal:

Description:

Camera Pose Estimation

For Artistic Purposes

Create a Blender plugin that finds poses and focal lengths

(extrinsic and intrinsic parameters) of a set of reference

images.

3D artists take reference images of scenes and objects they want to model. For modelling it is helpful to align the

virtual cameras in the modelling tool of choice (Blender, Maya, …) to the reference images.

There exists a Blender plugin[1] that utilizes two-point perspective and user input to find pose and focal length of a

single reference image.

This project aims to implement an SfM plugin to find relative poses and focal lengths of a set of reference images.

Open questions:

rely on user input for point matches or use SIFT + Feature Matching?

images might need to be undistorted since Blender’s camera does not model this phenomenon

[1] Per Gantelius, “BLAM”, https://github.com/stuffmatic/blam

Required: Python, C++

Recommended: Blender, OpenCV or COLMAP

Daniel Thul

[email protected]

Goal:

Description:

Transfer from Recognition to

Optical Flow by Matching Neural

Paths

Implementation of Optical Flow Method by Matching Neural Paths

The goal is to extend the stereo method of Savinov et al [1] to optical flow. The main challenge is with handling of

large memory requirements by passing only restricted subset of most probable labels during the back-propagation

phase of label likelihoods. The method could be implemented in any deep learning framework.

[1] Savinov et al., “Matching Neural Paths: Transfer from Recognition to Correspondence Search”, NIPS 2017

Required: C++, CUDA, any Deep learning framework Lubor Ladicky, [email protected]

Goal:

Description:

Navigation by

Reinforcement Learning

Benchmark different RL algorithms on their ability to learn to navigate to a goal

You will take one of the popular RL libraries like Tensorforce [2] or OpenAI Baselines [3] and benchmark them on

3D navigations tasks proposed in [1]. Those tasks are implemented as maps in a Vizdoom environment.

The agent is given a high reward for reaching the image-specified goal and small reward for collecting items like

healthkits (which should ignite his curiosity and make him explore). His goal is to maximize rewards.

You will compare the following RL methods: A3C, A2C, PPO.

[1] Savinov et al., "Semi-parametric topological memory for navigation", ICLR 2018, https://openreview.net/pdf?id=SygwwGbRW

[2] https://github.com/reinforceio/tensorforce

[3] https://github.com/openai/baselines

Required: python

Recommended: knowledge in Machine Learning, experience with

tensorflow and RL

Nikolay Savinov

[email protected]

Goal:

Description:

Appearance Representation based

on Auto-Encoders

Improve appearance model with deep auto-encoder

This project aims to build efficient appearance representations of shapes observed from

multiple viewpoints and over time. Recent work [1] has addressed this using Principal

Components Analysis (PCA). The goal of this project is to explore, as an alternative,

deep auto-encoders for dimensionality reduction.

The students will build on existing tools using MATLAB and python / tensorflow to

explore appearance representations obtained from auto-encoders and compare the results

to [1].

[1] Boukhayma et al., “Eigen appearance maps of dynamic shapes”, ECCV 201

Required: Python and MATLAB, some experience with image processing

Recommended: Some experience with deep learning / tensorflow

Dr. Vagia Tsiminaki ([email protected])

Dr. Lisa Koch ([email protected])

Auto-encoder for dimensionality reduction

Goal:

Description:

SuperPoint: Self-Supervised Interest Point Detection

and Description

The goal is to implement a self-supervised fully convolutional neural network for interest

point detection and description [1]

This project is to implement a self-supervised framework for training interest point detectors and

descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed

to patch-based neural networks, this fully-convolutional model operates on full-sized images and jointly

computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce

Homographic Adaptation, a multi-scale, multi- homography approach for boosting interest point detection

repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on

the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much

richer set of interest points than the initial pre-adapted deep model and any other traditional corner

detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when

compared to LIFT, SIFT and ORB.

[1] DeTone et. al., SuperPoint: Self-Supervised Interest Point Detection and Description, arXiv 2017

Recommended : Python and prior knowledge in machine learning Peidong Liu, CNB D102

[email protected]

Goal:

Description:

Real-Time Surface Reconstruction

Perform depth-map fusion directly into the mesh.

Traditional approaches rely on volumetric representations or point-clouds to represent

the environment and fuse the different depth measurements.

In this work we want to use a mesh representation to model the environment. The goal

is to fuse new depth estimates directly into the mesh. Adaptive tessellation is used to

represent different levels of geometric details in the scene.

The input to the algorithm is a set of calibrated RGBD images. The focus of the

is the implementation of the fusion algorithm. For this we will closely follow

Yienkiewicz et al. [1].

[1] Zienkiewicz et al., “Monocular, Real-Time Surface Reconstruction using Dynamic Level of

Detail”, 3DV 2016


Recommended: Experience with OpenCVOlivier Saurer <[email protected]>

Goal:

Description:

Data Generation with a Virtual

Simulator for Autonomous Driving

Generate 3D training data with a recent open urban

driving simulator for autonomous driving.

Recently there is an open-source simulator for autonomous driving research, which is named as

CARLA [1]. This simulator supports real-time data acquisition of RGB image, semantic

segmentation, and depth map, which can be used as the training data for the deep learning

methods.

This project aims to utilize the virtual simulator to generate more kinds of training data including

2D/3D instance-level segmentation, 3D bounding boxes, 3D shapes and poses of vehicles, etc.,

which will be used for the training of deep 3D detection methods.

[1] Dosovitskiy A, Ros G, Codevilla F, López A, Koltun V. CARLA: An open urban driving simulator. Conference onRobot Learning (CoRL), 2017

Required: C++, Python

Recommended: Familiar with UE4 programming

Zhaopeng Cui

[email protected]

CNB G104

Goal:

Description:

Data fusion for semantic

3D reconstruction

Improve the data fusion pipeline for semantic

3D reconstruction using a learning approach

Semantic 3D reconstruction is the task of jointly reconstructing and segmenting a 3D model. It has been shown in [1] that both

tasks can benefit from each other: the 3D structure offers ground for regularizing the segmentation, while the semantic information

gives access to shape priors (eg ground is flat and horizontal…).

Methods presented in [1] or [2] take as input multiple depth maps and corresponding 2D semantic segmentations, and fuse them

into a modified Truncated Signed Distance Function (TSDF) [3]. Though very efficient, this fusion could be improved in order to

obtain better input for the methods.

In this project we propose to leverage the availability of semantic 3D data, and machine learning libraries such as tensorflow, in

order to learn a method to fuse the data used for semantic 3D segmentation.

[1] Dense Semantic 3D Reconstruction, Häne et al., TPAMI 2017

[2] Learning Priors for semantic 3D reconstruction, Cherabier et al., unpublished 2018

[3] A volumetric method for building complex models from range images, Curless et Levoy, SIGGRAPH 1996

Required: Python, Tensorflow

Recommended: Optimization (Maths), C++

Ian Cherabier ([email protected])

Martin Oswald ([email protected])

Goal:

Description:

Surface Reconstruction in Medical Imaging:

Data and CNNs

Creation of synthetic MR datasets and their use in testing of

various surface reconstruction architectures

Required: Python, Matlab

Recommended: Experience with machine learning and TensorFlow

Katarina Tothova

[email protected]

Reconstruction of organ surfaces is an important task in medical image analysis,

especially in cardiac and neuro-imaging. Besides their significance in diagnosis and

surgical planning, high-quality organ surface models provide powerful measures for

statistical analysis or disease tracking.

Thanks to recent advances in machine learning, we are devising a deep neural

network–based approach for direct organ surface reconstruction from MRI data.

To test the efficacy of the proposed network architectures, it is necessary to design

and produce relevant synthetic MR data

Goal:

Description:

3D Appearance Super-resolution Benchmark

Generate appearance super-resolution benchmark datasets

This project aims to generate a Super-Resolution Appearance dataset and provide a systematic

benchmark for evaluation. Previous work [1] presents a framework for synthetic generation of

realistic benchmarks for 3D reconstruction from images. ETH3D Benchmark [2] covers a

variety of indoor and outdoor scenes. The goal of this project is to build on these works and

generate super-resolved appearance dataset for the multi-view case.

[1] A. Ley et al. “SyB3R: A Realistic Synthetic Benchmark for 3D Reconstruction from

Images” ECCV 2016

[2] T. Schöps et al. “A Multi-View Stereo Benchmark with High-Resolution Images and

Multi-Camera Videos” CVPR 2017

Required: Matlab/Python, some experience with image processing

Recommended: Experience with C++, scripting language


Goal:

Description:

Super-resolving Appearance of 3D Faces for

Dermatology App

Super-resolve appearance for mobile phone applications

This project aims to implement a super-resolution algorithm for appearance representations of 3D

faces. Previous work [1] presented a method to retrieve high resolution textures of objects observed

in multiple videos under small object deformations. The goal of this project is to implement the

proposed method for mobile phone applications where performance in terms of time and memory

are important.

The students will implement the super-resolution framework [1] using C++ .

The project can be build upon an existing C++/CUDA implementation of [2].

[1] Tsiminaki et al. “High resolution 3D shape texture from multiple videos” CVPR 2014

[2] D. Mitzel and T. Pock and T. Schoenemann and D. Cremers, Video Super Resolution using

Duality based TV-L1 Optical Flow, DAGM, pages 432-441, 2009


Recommended: Experience with Matlab


Dr. Martin Oswald ([email protected])

`

Goal:

Description:

Motion blur aware camera pose tracking

The goal is to implement a camera pose tracker for motion blurred images

Camera pose tracker is usually a front-end of a visual odometry (VO) algorithm. Most existing works

assume the input images to VO are sharp images. However, images can be easily blurred, which would

further fail the VO, if the camera moves too fast with a longer exposure time.

In this project, we plan to investigate and implement a motion blur aware camera pose tracker. To make

the problem tractable, we assume the reference image is sharp and only current image is being motion

blurred. Furthermore, we assume the depth map corresponding to the reference image is already known.

All the required dataset can be generated from a simulation tool, which is already being set up for you.

[1] Good programming skills in C++Peidong Liu and Vagia Tsiminaki, CNB D102

[email protected]

[email protected]

3D Vision, Spring Semester 2018

Goal:

Requirements / Tools: Supervisor:

Description:

Your Own Project

Learn about the techniques presented in the lecture

Choose your own topic!

Available hardware:

Google Tango TabletsMicrosoft HoloLens

GoPro Cameras

Intel RealSense Sensor

We find one for youRequired: Related to 3D Vision / topics of the lecture

Your Next Steps

• Find a group (ideally: groups of 3)

• Find a project (one of ours or your own)

• Topic subscription via doodle in a few days:

• For questions contact us via the lecture Moodle (preferred) or contact Nikolay per email

• First come first serve!

• Do not contact supervisors directly!

• After topic assignment: talk with your supervisor

• Write a project proposal

• Don’t worry: You’ll get reminders!

Feb 19 Introduction

Feb 26 Geometry, Camera Model, Calibration

Mar 5 Features, Tracking / Matching

Mar 12 Project Proposals by Students

Mar 19 Structure from Motion (SfM) + papers

Mar 26 Dense Correspondence (stereo / optical flow) + papers

Apr 2 Easter break

Apr 9 Bundle Adjustment & SLAM + papers

Apr 16 Student Midterm Presentations

Apr 23 Multi-View Stereo & Volumetric Modeling + papers

Apr 30 3D Modeling with Depth Sensors + papers

May 7 3D Scene Understanding + papers

May 14 4D Video & Dynamic Scenes + papers

May 21 Whitsundite

May 28 Student Project Demo Day = Final Presentations

Schedule

3d vision - cvg.ethz.ch · microsoft hololens mixed reality. ... collaboration with microsoft...

Documents