SilentCipher: Deep Audio Watermarking

Interspeech, 2024
Mayank Kumar Singh*, Naoya Takahashi, Weihsiang Liao, Yuki Mitsufuji

In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional…

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS, 2023
Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavann*, Aapo Hakala*, Yuichiro Koyama*, Naoya Takahashi, Shusuke Takahashi*, Tuomas Virtanen*, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper pro…

Iteratively Improving Speech Recognition and Voice Conversion

Interspeech, 2023
Mayank Kumar Singh*, Naoya Takahashi, Onoe Naoyuki*

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task…

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

Ryosuke Sawata*, Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which…

Nonparallel Emotional Voice Conversion for unseen speaker-emotion pairs using dual domain adversarial network Virtual Domain …

ICASSP, 2023
Nirmesh Shah*, Mayank Kumar Singh*, Naoya Takahashi, Naoyuki Onoe*

Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emo…

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP, 2023
Kin Wai Cheuk, Ryosuke Sawata*, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi*, Dorien Herremans*, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative t…

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP, 2023
Naoya Takahashi, Mayank Kumar Singh*, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we…

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR, 2023
Hao-Wen Dong*, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley*, Taylor Berg-Kirkpatrick*

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query…


Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.