t machine learning of t videos - efstratios gavves · university of amsterdam / efstratios gavves...
TRANSCRIPT
University of Amsterdam / Efstratios Gavves 1
THE MACHINE LEARNING OF TIME
IN VIDEOS
EFSTRATIOS GAVVES
UNIVERSITY OF AMSTERDAM
THE PROBLEM
University of Amsterdam / Efstratios Gavves
THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?
What is difference between time (video) and space (image)?
Image models work well for two reasons
− Convolutions → Images are spatially stationary signals
− Backprop → Full supervision → millions of labels
𝑃𝑥+𝛿 = 𝑃𝑥
Bounded
Bo
un
ded
Bo
un
ded
Bounded
University of Amsterdam / Efstratios Gavves
THE ROLE OF TIME IN SPATIOTEMPORAL SEQUENCES ?
What is difference between time (video) and space (image)?
Image models work well for two reasons
− Convolutions → Images are spatially stationary signals
− Backprop → Full supervision → easy to get millions of labels
Temporal stationarity hard to guarantee in long & complex videos
(without unlimited data)
− (Short) Convolutions are not enough
− Modelling temporal transitions directly instead
− Quite hard for technical and conceptual reasons
4
𝑃𝑥+𝛿𝑥 = 𝑃𝑥𝑃𝑥+𝛿𝑥 = 𝑃𝑥
4
𝑃𝑥+𝛿𝑥 = 𝑃𝑥
𝑃𝑡+δδδδδ … δ ≠ 𝑃𝑡
𝑃𝑥+𝛿𝑥 = 𝑃𝑥
𝑃𝑡+δ ≈ 𝑃𝑡
𝑃𝑥+𝛿 = 𝑃𝑥
Bounded
Bo
un
ded
Bo
un
ded
Bounded
Unbounded
VIDEOGRAPH … OR TIME AS A GRAPH
ONGOING
A. SmeuldersE. GavvesN. Hussein
University of Amsterdam / Efstratios Gavves
COMPLEX ACTIONS
6
Preparing Breakfast Stirring Food
Long & Complex Action One-action
University of Amsterdam / Efstratios Gavves
COMPLEX ACTIONS
7
get
cook
put
Cooking a Meal
wash
• •
•2. EXTENT
3. DEPENDENCY
Complex actions of Charades are30 sec, compared to 5 sec of Kinetics.
The one-actions differ in their temporal extents.
Temporal dependency , albeit weak, between the one-actions.
1. LONG-RANGE
University of Amsterdam / Efstratios Gavves
A YEAR AGO: TIMECEPTION
8
Reduce the complexity of 3D conv by factorizingIt into depth-wise separable 1D conv
2. TEMPORAL-ONLY CONV• • •
𝐶
𝑇
𝑘𝑡
3. MULTI-SCALE KERNELSUsing different kernel sizes, k, or dilation rates to account for verities in temporal extents of one-actions.
𝑘 = {1, 3, 5, 7}𝑑 = {1, 2, 3}
1. EFFICIENT MODULAR LAYERGrouped conv and concat+shuffle to reduce the computational cost of typical 3d conv.
Grouped Conv
Channel Shuffle
University of Amsterdam / Efstratios Gavves
TIMECEPTION ON CHARADES
9
10
TC
R2D
3025 35 40
3
Performance mAP %
Tim
este
ps
2
1
10
1020M 30M 40M
TC TC
I3DR3D
NL
Results: Processing 1K frames → better accuracies
Insight: longer range temporal relations are important
Question: how to go longer range?
10x longer videos possible
University of Amsterdam / Efstratios Gavves
STRETCHING TIME: TIME AS A GRAPH
Nodes: represent key one-actions in the activity
Edges: encode the temporal relationship between one-actions
Sublinear temporal representation
− Nodes are temporal concept-clusters
− Similar/irrelevant frames discarded
− Scales up to hours long videos (hopefully)
Increased interpretability
− Graphs are discrete
− Straightforward to reason
10
Graph-based Representation
University of Amsterdam / Efstratios Gavves
VIDEOGRAPH
Video
𝑣𝑠1
𝑠2
𝑠𝑛
…
Tim
e
3D CNN
𝑥1𝑠1
University of Amsterdam / Efstratios Gavves
VIDEOGRAPH
Video
𝑣𝑠1
𝑠2
𝑠𝑛
…
Tim
e
3D CNN
𝑥1𝑠1
‧‧‧
Concepts
𝑥1
𝐶
𝛼NodeAttention
University of Amsterdam / Efstratios Gavves
VIDEOGRAPH
Video
𝑣𝑠1
𝑠2
𝑠𝑛
…
Tim
e
3D CNN
𝑥1𝑠1
‧‧‧
Concepts
𝑥1
𝐶
𝛼0.4 0.70.10.00.0
…
Transition
NodeAttention
University of Amsterdam / Efstratios Gavves
VIDEOGRAPH
14
Video
𝑣𝑠1
𝑠2
𝑠𝑛
…
Tim
e
3D CNN
𝑥1𝑠1
‧‧‧
Concepts
𝑥1
𝐶
𝛼0.4 0.70.10.00.0
…
MLP
‧‧‧
Predictions
GraphEmbedding
GraphEmbedding
NodeAttention
University of Amsterdam / Efstratios Gavves
VIDEOGRAPH: GRAPH-EMBEDDING
15
0.4 0.70.10.00.0
…
GraphEmbedding
GraphEmbedding
𝒀
TimeConv1D
N × T × HW× C
NodeConv1D
ChannelConv3D
MaxPool
N
4×T
4× HW × C
Learn set of latent concepts representation Ƹ𝑐 from randomly initialized 𝑐
Dot product ⊙ for similarity between a feature 𝑥 and all latent concepts Ƹ𝑐, followed by activation 𝜎.
Reweight all learned concepts 𝑐 using the activation values 𝛼
University of Amsterdam / Efstratios Gavves
EXPERIMENTS
16
BreakfastCharades
TIME-ALIGNED NETS… OR NEURAL LAYERS AS TIME STEPS
VIDEO TIME: ENCODERS PROPERTIES EVALUATION, BMVC 2018
C. SnoekE. GavvesA. Ghodrati
University of Amsterdam / Efstratios Gavves
WHAT ARE (SOME) PROPERTIES OF TIME?
18
Temporal
Asymmetry
Temporal
Continuity
Temporal
Causality
Temporal
Redundancy
University of Amsterdam / Efstratios Gavves
DOMINANT APPROACHES FOR MODELLING THESE PROPERTIES?
19
LSTMs learn transitions between subsequent states 3D convolutions learn spatiotemporal correlations
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997Ji et al. 3d convolutional neural networks for human action recognition. PAMI, 2013
Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015
Carreira and Zisserman, Quo vadis action recognition a new model and the kinetics
dataset? CVPR, 2017
University of Amsterdam / Efstratios Gavves
LSTM AND C3D: ARROW OF TIME?
20
LSTM
C3D
University of Amsterdam / Efstratios Gavves
REVISITING RECURRENT NEURAL NETWORKS
Recurrent Nets are highly sensitive dynamical systems (Pascanu and Bengio, 2013)
− Even with highly discriminative symbolic (one-hot vector) inputs
− Gradients very sensitive to initialization → Poor learning! → No generalization
Visual features over time -even the best ones- are:
− much noisier
− much less discriminative
− much more redundant
Learning LSTM on videos is orders of magnitude harder
− Chaotic regime → no useful gradients → no learning
− Forward and Backward (or even shuffling) LSTM score the same accuracy on arrow of time
21
Basically, with high-dim noisy inputs LSTMs do not do sequence modelling but some weird entangled pooling
University of Amsterdam / Efstratios Gavves
PROPOSAL: TIME-ALIGNED NEURAL NETS
ConvNets are much better with vanishing and exploding gradients, noisy and redundant inputs
No parameter sharing → no iterated maps → no chaotic regime
Moreover, the premise of LSTM parameter sharing is infinite Markov chains
In practice, however, we chop it off at T steps → like a ConvNet with T layers
Idea: Why not flip the ConvNet to align the layers with time steps?
22
ConvNets can handle vanishing/exploding/noisy/redundant because they do not share parameters.
Hypothesis
University of Amsterdam / Efstratios Gavves
PROPOSAL: TIME-ALIGNED NEURAL NETS
Idea: Why not flip the ConvNet to align the layers with time steps?
No vanishing/exploding gradients, no problems with noisy and redundant inputs
23
University of Amsterdam / Efstratios Gavves
RECHECKING ARROW OF TIME
Time-Aligned DenseNet gives much cleaner temporal clusters
24
Conclusion: Poor temporal modelling is likely due to hard –and thus unsuccessful- optimization
SPIKING NEURAL NETS … OR EVENT-DRIVEN TIME
ICLR 2017, 2018, 2019
M. WellingE. GavvesP. O’Connor
University of Amsterdam / Efstratios Gavves
MOTIVATION
Most sensory data is temporally redundant
For today’s models temporal redundancy is a bug
For spiking models temporal redundancy is a feature
Great promise of human brain-like efficiency
Unfortunately spiking networks are still in the pre-alpha release stage :P
− How to backprop through spikes is not clear
− Also, in theory one must backprop to infinity which clearly creates lots of problems
26
University of Amsterdam / Efstratios Gavves
AND IF YOU DON’T BELIEVE IT: YOU ONLY NEED MNIST EXPERIMENTS TO PUBLISH
27
WHAT IS MISSING & CHALLENGES?
University of Amsterdam / Efstratios Gavves
WHAT IS MISSING & CHALLENGES?
Too-much image (space-only) focused
29
Image problem space“coordinate system”
Images
University of Amsterdam / Efstratios Gavves
WHAT IS MISSING & CHALLENGES?
Too-much image (space-only) focused
30
Images
Videos
Curse of problem dimensionality
Image problem space“coordinate system”
University of Amsterdam / Efstratios Gavves
WHAT IS MISSING & CHALLENGES?
Too-much image (space-only) focused models
#1. Transition to models focused on the spatiotemporal
This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?
Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels
Maybe time integrated in neural network structure, not neural network input/output
31
Images
VideosSpatiotemporal problem space
“coordinate system”
Deep Learning
Machine Learning
Computer Vision
Temporal Learning& Dynamics
University of Amsterdam / Efstratios Gavves
WHAT IS MISSING & CHALLENGES?
Too-much image (space-only) focused models
#1. Transition to models focused on the spatiotemporal
This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?
Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels
Maybe time integrated in neural network structure, not neural network input/output
#2. Tighter community and better communication channels
This initiative is a great step forward
32
Images
VideosSpatiotemporal problem space
“coordinate system”
Deep Learning
Machine Learning
Computer Vision
Temporal Learning& Dynamics
University of Amsterdam / Efstratios Gavves
WHAT IS MISSING & CHALLENGES?
Too-much image (space-only) focused models
#1. Transition to models focused on the spatiotemporal
This requires a deeper understanding of: The Role of Time in Spatiotemporal Sequences ?
Role of Time == Long spatiotemporal sequences == temporal non-stationarity/no-labels
Maybe time integrated in neural network structure, not neural network input/output
#2. Tighter community and better communication channels
This initiative is a great step forward
#3. Balanced datasets? Video is not only Action. Trade-off between
(i) larger datasets → better performance
(ii) larger datasets → hard to run and compare equally
Maybe having a “temporal property”spatiotemporal decathlon?
33
Images
VideosSpatiotemporal problem space
“coordinate system”
Deep Learning
Machine Learning
Computer Vision
Temporal Learning& Dynamics