Spatio-Temporal Convolution-Attention Video Network

VIEW PUBLICATION

Ali Diba*

Vivek Sharma

Mohammad.M Arzani*

Luc Van Gool*

* External authors

ICCV 2023

2023

Abstract

In this paper, we present a hierarchical neural network based on convolutional and attention modeling for short and long-range video reasoning, called Spatio-Temporal Convolution-Attention Video Network (STCA). The proposed method is capable of learning appearance and temporal cues in two stages with different temporal depths to maximize engagement of the short-range and long-range video sequences. It has the benefits of convolutional and attention networks in exploiting spatial and temporal cues for a new spatio-temporal sequence modeling. Our method is a novel mixer architecture to obtain robust properties of convolution (such as translational equivariance) while having the generalization and sequential modeling ability of transformers to deal with dynamic variations in videos. The proposed video deep neural network aims to exploit spatio-temporal information in two stages: 1.) Short Clip Stage (SCS) and 2.) Long Video Stage (LVS). SCS handles spatio-temporal cues dealing with short-range video clips and operates on video frames with 3D convolutions and multi-headed self-attention modeling. Since SCS operates on video frames, this reduces the quadratic complexity of the self-attention operation. In LVS, we mitigate the issue of modeling long-range temporal self-attention. LVS models long-range temporal reasoning using representation (i.e., tokens) obtained from SCS. LVS consists of variants of long-range temporal modeling mechanisms for learning compact and robust global temporal representations of the entire video. We conduct experiments on six challenging video recognition datasets: HVU, Kinetics (400, 600, 700), Something-Something V2, and Long Video Understanding dataset. Through extensive evaluations and ablation studies, we show outstanding performances in comparison to state-of-the-art methods on the mentioned datasets.

Related Publications

Argus: A Compact and Versatile Foundation Model for Vision

CVPR, 2025
Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, Yihao Zhan, Naohiro Adachi, Ryoji Eki, Michael Spranger, Peter Stone, Lingjuan Lyu

While existing vision and multi-modal foundation models can handle multiple computer vision tasks, they often suffer from significant limitations, including huge demand for data and computational resources during training and inconsistent performance across vision tasks at d…

DECO-Bench: Unified Benchmark for Decoupled Task-Agnostic Synthetic Data Release

NeurIPS, 2024
Lingjuan Lyu, Vivek Sharma, Farzaneh Askari

In this work, we tackle the question of how to systematically benchmark task-agnostic decoupling methods for privacy-preserving machine learning (ML). Sharing datasets that include sensitive information often triggers privacy concerns, necessitating robust decoupling methods…

SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks

ECCV, 2024
Abhishek Singh*, Vivek Sharma, Ramesh Raskar*, Rohan Sukumaran, John Mose, Jeffrey Chiu, Justin Yu

In this work, we tackle the question of how to benchmark reconstruction of inputs from deep neural networks (DNN) representations. This inverse problem is of great importance in the privacy community where obfuscation of features has been proposed as a technique for privacy-…

SEE ALL

HOME
Publications
Spatio-Temporal Convolution-Attention Video Network

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE