Authors

* External authors

Venue

Date

Share

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

Neha Sahipjohn

Ashishkumar Gudmalwar

Nirmesh Shah*

Pankaj Wasnik

* External authors

INTERSPEECH September 2024

2024

Abstract

Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.

Related Publications

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Interspeech, 2026
Ishan Biyani, Nirmesh Shah*, Ashishkumar Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patter…

Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

AAAI, 2025
Kumud Tripathi, Aditya Srinivas Menon, Aman Gupta, Raj Prakash Gohil, Pankaj Wasnik

The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous work…

In-Domain African Languages Translation Using LLMs and Multi-armed Bandits

ACL, 2025
Pratik Rakesh Singh, Kritarth Prasad, Mohammadi Zaki, Pankaj Wasnik

Neural Machine Translation (NMT) systems face significant challenges when working with low-resource languages, particularly in domain adaptation tasks. These difficulties arise due to limited training data and suboptimal model generalization, As a result, selecting an opti- …

  • HOME
  • Publications
  • DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.