Report copyright - Learning Joint Spatial-Temporal Transformations for Video ... · STTN consists of 1) a frame-level encoder, 2) multi-layer multi-head spatial-temporal transformers and 3) a frame-level
Please pass captcha verification before submit form