Authors

* External authors

Venue

Date

Share

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

Shivam R Mhaskar

Nirmesh Shah*

Mohammadi Zaki

Ashishkumar Gudmalwar

Pankaj Wasnik

* External authors

NAACL-2024

2024

Abstract

Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.




Related Publications

VECL-TTS: Voice identity and Emotional style aware Cross-Lingual TTS

Interspeech, 2024
Ashishkumar Gudmalwar, Nirmesh Shah*, Sai Akarsh, Pankaj Wasnik

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transf…

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

Interspeech, 2024
Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah*, Pankaj Wasnik

Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns…

Nonparallel Emotional Voice Conversion for unseen speaker-emotion pairs using dual domain adversarial network Virtual Domain …

ICASSP, 2023
Nirmesh Shah*, Mayank Kumar Singh*, Naoya Takahashi, Naoyuki Onoe*

Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emo…

  • HOME
  • Publications
  • Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.