VECL-TTS: Voice identity and Emotional style aware Cross-Lingual TTS

VIEW PUBLICATION

Ashishkumar Gudmalwar

Nirmesh Shah*

Sai Akarsh

Pankaj Wasnik

* External authors

INTERSPEECH September 2024

2024

Abstract

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83\% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).

Related Publications

DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

ICCV, 2025
Munish Monga, Vishal Chudasama, Pankaj Wasnik, Biplab Banerjee*

Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain I…

Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance

CVPR, 2025
Sanchayan Santra, Vishal Chudasama, Pankaj Wasnik, Vineeth N Balasubramanian

Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network…

Faster Machine Translation Ensembling with Reinforcement Learning and Competitive Correction

NAACL, 2025
Kritarth Prasad, Mohammadi Zaki, Pratik Singh, Pankaj Wasnik

Ensembling neural machine translation (NMT) models to produce higher-quality translations than the $L$ individual models has been extensively studied. Recent methods typically employ a candidate selection block (CSB) and an encoder-decoder fusion block (FB), requiring infere…

SEE ALL

HOME
Publications
VECL-TTS: Voice identity and Emotional style aware Cross-Lingual TTS

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE