Authors

* External authors

Venue

Date

Share

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

Ashishkumar Gudmalwar

Nirmesh Shah*

Pankaj Wasnik

Ishan Biyani

Rajiv R. Shah

* External authors

AAAI-25

2025

Abstract

The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages

Related Publications

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

CVPRW, 2025
Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imba…

Open-Set Object Detection By Aligning Known Class Representations

WACV, 2025
Vishal Chudasama, Naoyuki Onoe*, Pankaj Wasnik, Hiran Sarkar, Vineeth N Balasubramanian

Open-Set Object Detection (OSOD) has emerged as a contemporary research direction to address the detection of unknown objects. Recently, few works have achieved remarkable performance in the OSOD task by employing contrastive clustering to separate unknown classes. In contra…

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

ICASSP, 2025
Pankaj Wasnik, Kumud Tripathi, Raj Gothi

Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enha…

  • HOME
  • Publications
  • EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.