carsten rother microsoft research cambridge
TRANSCRIPT
Carsten Rother Microsoft Research Cambridge
~140 employees (~100 Researchers, ~30 RSDEs, ~10 Admin)
Six different groups:
Computer-Mediated Living
Machine Learning & Perception
Cambridge Innovation Development
Computational Science
Programming Principles & Tools
Systems & Networking
• Computer Vision group: medical vision, recognition, reconstruction, image editing, …
• Machine learning group: Infer.Net, Online Services and Advertisement, Xbox Ranking
• Constrained Reasoning group: Planning and Optimization
• Socio-Digital Systems: Understanding human needs for future technology
• Sensors and Devices:
SenseCam, Gadeteer, …
• Interactive 3D Technologies group
Machine Learning
Hardware design
Human studies
I3D mission: new user experiences
Graphics
Computer Vision
Intersection workshop (Mai 2012, Cambridge): http://research.microsoft.com/en-us/events/intersection12/
• All factors in the graph are trees • Discriminatively training of millions of Parameters • We can handle many loss-function
Decision/Regression Trees Random Fields
+
Discrete labelling tasks:
Noisy input Ours [Zoran, Weiss, ICCV ‘11]
Continuous labelling tasks:
Test input Ground Truth
Trees Trees & Field
• PatchMatch stereo, BMVC ’11 PatchMatchBP stereo, BMVC ‘12
• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 – a review • SceneStereo, ECCV ‘12
• Learning interactive image segmentation, IJCV ‘12
• PatchMatch stereo, BMVC ’11 PatchMatchBP stereo, BMVC ‘12
• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 – a review • SceneStereo, ECCV ‘12
• Learning interactive image segmentation, IJCV ‘12
Left view Right view
Depth map
Left view Right view
Local stereo matching: rectangular region (patch) check photo-consistency
Local stereo matching: rectangular region (patch) check photo-consistency
Fails at discontinuities
Fails at non-fronto-parallel planes
No continuous depth label
Slow
Adaptive support weights [Yoon, CVPR ‘05]
Fails at discontinuities
Fails at non-fronto-parallel planes
No continuous depth label
Slow
3 continuous parameters (depth + normal) for each pixel
Fails at discontinuities
Fails at non-fronto-parallel planes
No continuous depth label
Slow
Depth map
Depth map
Depth map
Red Pixel means in the 4-neighborhood is a better solution
Red Pixel means in the 4-neighborhood is a better solution
Red Pixel means in the 4-neighborhood is a better solution
Red Pixel means in the 4-neighborhood is a better solution
Red Pixel means in the 4-neighborhood is a better solution
Red Pixel means in the 4-neighborhood is a better solution
1. Random initialization 2. Go through pixel in sequential order: 2a. consider solution from left/top neighbour 2b. sample around current solution 0 1
Left image –
Reindeer
(Middlebury) Left and right disparity maps (intermediate step of iteration 1)
Left image – Sawtooth
(Middlebury)
Image consists of 3 planes -
~80.000 guesses for yellow plane Ground truth disparities
Randomization is in our favour
No cost volume needed: well suited for large images and large depth range
Left view Right view
PatchMatch Stereo result
Unary term (photo-consistency)
Pairwise term (local curvature)
Add a Markov Random Field:
Continuous 3-dimension
Cost ≠ 0: local curvature or discontinuity
Cost = 0 both planes are aligned in 3D
So far, we have been running with λ = 0
For non-zero λ, with super high-dimensional u:
Gradient descent
Gradient descent + Fusion move
Relaxation + Gradient descent
Simulated Annealing
Continuos Belief Propagation
M2->3
Operation 1: compute neg-log Belief
s
Operation 2: re-compute Message
t s
M1->2
Sequential schedule
M1->4
Final output: us* = argmin Bs(us) us
target
Source (shifted 4.0 + noise)
Ground Truth
Error: 0.618; Unary only
Error: 0.251
Ground Truth
12x12 discrete labels
target
Source (shifted 4.2 + noise)
GT
Error: 0.66
Error: 1.9; unary only GT
12x12 discrete labels
Error: 5.68
Error: 3.46; unary only GT
12x12 discrete labels
M2->3 M1->2
Sequential schedule
M1->4
0 1
Each pixel has different set of particles:
t
0 1
s
Comment: we do max-product, hence we may not want to approximate true continuous distributions
t
us
ut
Bs(us)
(neg. log Belief) Bt(ut)
t
M2->3 M1->2
Sequential schedule
M1->4 s
0 1 0 1
0 1
= (us-ut)2
ut us
us
M2->3 M1->2
Sequential schedule
M1->4
0 1
Sample around current particles
0 1
s
us us
Final output: us* = argmin Bs(us) us
GT
Error: 5.68 discrete
Energy: 47308 Error: 0.9713
Random init
Energy: 42628
Error: 0.8259 Best unary init (144 discrete)
t s
The message Mt->s has high values for s = t since smoothness term is (us-ut)2
PM idea: sample also at your neighbours solutions!
We call this variant of Particle BP PatchMatch BP (PMBP)
0 1 0 1
= (us-ut)2
ut us
GT
Energy: 42628
Error: 0.8259 Best unary init
Random init Energy: 21959
Error: 0.4159 50 particles
Random init Energy: 22593 Error: 0.3864 1 particle
1 particles
Energy: 22593
Error: 0.3864
Energy: 21959
Error: 0.4159
50 particles
PatchMatch is a special Form of Particle BP
λ = 0
1 particle per node
Sample from neighbour nodes
Iterate two steps (in a nutshell):
1) Run full BP until convergence (convex version which solves the LP relaxation)
2) Sample all nodes individually
Highly ranked in Middlebury Table
• PatchMatch stereo, BMVC ‘11 • PatchMatchBP stereo, BMVC ‘12
• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 • SceneStereo, ECCV ‘12
• Learning interactive image segmentation, IJCV ‘12
Ultimate Goal: Recover: geometry, light, material Recognise: object instances, attributes … and do that jointly
Theoretical Challenges: statistical models of the world and the captured images Combines statistical Priors and physical constraints Practical Challenges: Robustness Real-time inference Task-driven, e.g. Robotics
To achieve this: latest machine learning latest optimization techniques
Assignment of pixels to surfaces
Simple explanation: describe the scene by a few low-degree surfaces (splines, planes) Goal: depth estimation improves
Without prior With prior
Simple explanation: describe scene by a few Objects: - compact in 3D - Connected in 3D - each object has a color model Goal: depth estimation improves
Objects o
Depth d
Objects o
Simple explanation: describe scene by a few Objects: - compact in 3D (use bbox) - each object has a color model - Physical constraints Goal: 1) depth estimation improves 2) improves object extraction
1) Create proposal pool
2) Rank proposal pool
3) Combine best objects and recognize
Use stereo images
boat
sky
water
Goals: • Reason in 3D with
physical constraints
• Improve depth estimation
Left input image
Object labelling proposal 1
Object labelling proposal 2
Output: - Object labelling - Depth labelling - Object 3D bounding boxes - Object colour distribution
Stereo: photo-consistency
Objects:
colour model
Prior on number of objects
Left input image PatchMatch Stereo Result
Object mask
Depth map
Physical properties:
Bounding Box tightness
Bounding Box intersection
Bounding Box Gravity
Merging (simulated annealing, patchmatch)
Exploration (mean-shift, patchmatch)
Object maps
Multiple Scene Proposals by varying the prior on number of objects
Good rank in Middlebury table
Green: this term is useful
All Terms are useful
Images
Ground truth
Our labelling
2D
Ours
GT
Object stereo
2D
Object stereo
Ours
Large Scale Train and Test
Real-time
Do full 3D reconstruction (KinectFusion)
Model all physical properties: Light, Material
Use graphics engine for train and test “analysis by synthesis”
• PatchMatch stereo, BMVC ‘11 PatchMatchBP stereo, BMVC ‘12
• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 SceneStereo, ECCV ‘12
• Learning interactive image segmentation, IJCV ‘12
Weights w
Training Time
How much user input shall we use for learning?
predictions
Testing Time
prediction
prediction
Static brush
Static trimap
Training Time Testing Time
Goal: User should reach a satisfying result in as few interactions as possible
Define: “interaction” and “satisfying”
Human (averaged over 6 users)
Computer (simulated brush strokes)
Algorithmic State
Suggested action
Ground Truth
Current Solution
What type of user? (novice user, advanced user)
Adjusting weights with the learning curve of the user
Other interactive systems
• PatchMatch stereo, BMVC ‘11 PatchMatchBP stereo, BMVC ‘12
• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 SceneStereo, ECCV ‘12
• Learning interactive image segmentation, IJCV ‘12