Authors

Venue

Date

Share

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Ayush Ghadiya

Purbayan Kar

Vishal Chudasama

Pankaj Wasnik

CVPRW-24

2025

Abstract

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Related Publications

Open-Set Object Detection By Aligning Known Class Representations

WACV, 2025
Vishal Chudasama, Naoyuki Onoe*, Pankaj Wasnik, Hiran Sarkar, Vineeth N Balasubramanian

Open-Set Object Detection (OSOD) has emerged as a contemporary research direction to address the detection of unknown objects. Recently, few works have achieved remarkable performance in the OSOD task by employing contrastive clustering to separate unknown classes. In contra…

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

ICASSP, 2025
Pankaj Wasnik, Kumud Tripathi, Raj Gothi

Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enha…

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

AAAI, 2025
Ashishkumar Gudmalwar, Nirmesh Shah*, Pankaj Wasnik, Ishan Biyani, Rajiv R. Shah

The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC frame…

  • HOME
  • Publications
  • Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.