total capture - cvssp · 2017-10-03 · related work tuesday, 03 october 2017 6 wei et al....

26
Total Capture 3D Human Pose Estimation Fusing Video and Inertial Sensors M. Trumble, A. Gilbert, C. Malleson, A. Hilton and J. Collomosse Centre for Vision, Speech and Signal Processing Tuesday, 03 October 2017 1

Upload: others

Post on 17-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Total Capture3D Human Pose Estimation Fusing Video and Inertial Sensors

M. Trumble, A. Gilbert, C. Malleson, A. Hilton and J. CollomosseCentre for Vision, Speech and Signal Processing

Tuesday, 03 October 2017 1

Page 2: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Motivation

Tuesday, 03 October 2017 2

Image credit: Electronic Arts

Page 3: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Related Work

Tuesday, 03 October 2017 6

Wei et al. Convolutional Pose Machines, CVPR 2016

Pavlakos et al. Harvesting Multiple Views for Marker-less 3D Human Pose Annotations, CVPR 2017

Andrews et al. Real-time Physics-based Motion Capture with Sparse Sensors, CVMP 2016

Page 4: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Contributions

Tuesday, 03 October 2017 7

• Accurate 3D human pose estimation

• 3D Convolutional Neural Network

• Fusion of video and IMUs

• New multi-modal dataset

Page 5: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Contributions

Total Capture Dataset

Tuesday, 03 October 2017 8

• 4 x 6 metre capture volume

• 8 x 1080p60 video cameras

• 13 IMU sensors

• Vicon ground truth labelling

• 5 subjects x 12 sequences

http://cvssp.org/data/totalcapture

Page 6: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Contributions

Total Capture Dataset

Tuesday, 03 October 2017 9

Xsens MTw Awinda wireless motion

trackers

• Calibrated orientation and

acceleration per unit at 60Hz

Vicon motion capture for testing

• Solved skeleton provided in BVH

format, also 60Hz

http://cvssp.org/data/totalcapture

Page 7: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Overview

Tuesday, 03 October 2017 10

Page 8: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Volumetric Pose Estimation – Probabilistic Visual Hull (PVH)

Tuesday, 03 October 2017 11

• Geometric proxy constructed from MVV

• Capture volume decimated into 1cm3 grid

• Voxels assigned probability of occupancy

• Downsampled to 30x30x30 grid for CNN

0 1

Page 9: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Volumetric Pose Estimation – 3D CNN Training

Tuesday, 03 October 2017 12

• Trained with stochastic gradient descent to minimize mean squared error

over 26 3D joint positions

• 100K unique training poses / 50K test from Total Capture dataset

• Augmented during training with random rotation around vertical axis

Page 10: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Inertial Pose Estimation

Tuesday, 03 October 2017 13

• 13 inertial measurement units (IMUs)

• Arms and legs, feet, head, sternum and pelvis

• Manual calibration to an initial T-pose

• Joint angles inferred by forward kinematics

Page 11: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Inertial Pose Estimation – forward kinematics

Tuesday, 03 October 2017 14

Assume fixed relative orientation between

each IMU 𝑘 ∈ 1,13 and bone: 𝑅𝑖𝑏𝑘

Global bone orientation 𝑹𝒃𝒌 = (𝑹𝒊𝒃

𝒌 )−𝟏 𝑹𝒊𝒘𝒌 𝑹𝒊𝒎

𝒌

where 𝑅𝑖𝑤𝑘 is IMU reference frame in global coordinates

and local IMU measurement 𝑅𝑖𝑚𝑘

Page 12: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Inertial Pose Estimation – forward kinematics

Tuesday, 03 October 2017 15

Local joint rotation 𝑹𝒉𝒊 = 𝑹𝒃

𝒊 (𝑹𝒃𝒑𝒂𝒓(𝒊)

)−𝟏

Inferred from parent bone, 𝑝𝑎𝑟(𝑖)by forward kinematics beginning at root node

Page 13: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Temporal Sequence Prediction (TSP)

Tuesday, 03 October 2017 16

• Long Short Term Memory RNN (LSTM)

• Exploits temporal nature of motion

• Independent model for each modality

• Learns joint locations based on

previous 5 frames

Page 14: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Temporal Sequence Prediction (TSP) – LSTM details

Tuesday, 03 October 2017 17

Memory cell,

𝒄𝒕 = 𝒇𝒕 ∘ 𝒄𝒕−𝟏 + 𝒊𝒕 ∘ 𝝈𝒉(𝑾𝒙𝒙𝒕 + 𝑼𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

Input vector 𝑥𝑡, output vector ℎ𝑡 = 𝑜𝑡 ∘ 𝜎ℎ(𝑐𝑡),learnt weights 𝑊 and 𝑈

sigmoid function 𝜎𝑔,hyperbolic tangent 𝜎ℎ,

vector constant 𝑏

Page 15: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Temporal Sequence Prediction (TSP) – LSTM details

Tuesday, 03 October 2017 18

Memory cell,

𝒄𝒕 = 𝒇𝒕 ∘ 𝒄𝒕−𝟏 + 𝒊𝒕 ∘ 𝝈𝒉(𝑾𝒙𝒙𝒕 + 𝑼𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

Input gate 𝒊𝒕 = 𝝈𝒈(𝑾𝒊𝒙𝒕 + 𝑼𝒊𝒉𝒕−𝟏 + 𝒃𝒊)

Input vector 𝑥𝑡, output vector ℎ𝑡 = 𝑜𝑡 ∘ 𝜎ℎ(𝑐𝑡),learnt weights 𝑊 and 𝑈

sigmoid function 𝜎𝑔,hyperbolic tangent 𝜎ℎ,

vector constant 𝑏

Page 16: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Temporal Sequence Prediction (TSP) – LSTM details

Tuesday, 03 October 2017 19

Memory cell,

𝒄𝒕 = 𝒇𝒕 ∘ 𝒄𝒕−𝟏 + 𝒊𝒕 ∘ 𝝈𝒉(𝑾𝒙𝒙𝒕 + 𝑼𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

Input vector 𝑥𝑡, output vector ℎ𝑡 = 𝑜𝑡 ∘ 𝜎ℎ(𝑐𝑡),learnt weights 𝑊 and 𝑈

sigmoid function 𝜎𝑔,hyperbolic tangent 𝜎ℎ,

vector constant 𝑏

Forget gate 𝒇𝒕 = 𝝈𝒈(𝑾𝒇𝒙𝒕 +𝑼𝒇𝒉𝒕−𝟏 + 𝒃𝒇)

Page 17: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Temporal Sequence Prediction (TSP) – LSTM details

Tuesday, 03 October 2017 20

Memory cell,

𝒄𝒕 = 𝒇𝒕 ∘ 𝒄𝒕−𝟏 + 𝒊𝒕 ∘ 𝝈𝒉(𝑾𝒙𝒙𝒕 + 𝑼𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

Input vector 𝑥𝑡, output vector ℎ𝑡 = 𝑜𝑡 ∘ 𝜎ℎ(𝑐𝑡),learnt weights 𝑊 and 𝑈

sigmoid function 𝜎𝑔,hyperbolic tangent 𝜎ℎ,

vector constant 𝑏

Output gate 𝒐𝒕 = 𝝈𝒈(𝑾𝒐𝒙𝒕 + 𝑼𝒐𝒉𝒕−𝟏 + 𝒃𝒐)

Page 18: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation – video branch

Human 3.6M

Tuesday, 03 October 2017 21

PVH Only PVH + TSP Ground Truth Source

Page 19: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation – video branch

Human 3.6M

Tuesday, 03 October 2017 22

Average per joint error in millimetres

Page 20: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Pipeline

Fusion layer

Tuesday, 03 October 2017 23

Page 21: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation – full pipeline

Total Capture Dataset – Full Pipeline

Tuesday, 03 October 2017 24

PVH + TSP IMU + TSP Fusion Source

Page 22: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation – full pipeline

Total Capture Dataset

Tuesday, 03 October 2017 25

Average per joint error in millimetres

Page 23: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation – full pipeline

Total Capture Dataset – Full Pipeline

Tuesday, 03 October 2017 26

PVH + TSP IMU + TSP Fusion Source

Page 24: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation

Training data volume PVH resolution

Tuesday, 03 October 2017 27

Training data randomly sampledfrom ~100k MVV frames

16x16x16 48x48x48

Page 25: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Evaluation

Camera ablation study

Tuesday, 03 October 2017 28

4 cameras 6 cameras 8 cameras

Relative accuracy change (mm/joint)

Page 26: Total Capture - CVSSP · 2017-10-03 · Related Work Tuesday, 03 October 2017 6 Wei et al. Convolutional Pose Machines, CVPR 2016 Pavlakos et al. Harvesting Multiple Views for Marker-less

Conclusion

• Novel 3D human pose estimation fusing MVV

and IMU signals

• Demonstrates high accuracy and

complementary nature of the two modalities

• New hybrid MVV dataset including video, IMU

and 3D ground truth

Tuesday, 03 October 2017 29

http://cvssp.org/data/totalcapture