Naoya Takahashi – Sony AI

Publications

LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

NEURIPS, 2025 | Mayank Kumar Singh*, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji

This paper presents a novel approach to deter unauthorized deepfakes and enable user tracking in generative models, even when the user has full access to the model parameters, by integrating key-based model authentication with watermarking techniques. Our method involves pro...

Read Now

The whole is greater than the sum of its parts: improving music source separation by bridging networks

EURASIP, 2024 | Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation...

Read Now

SilentCipher: Deep Audio Watermarking

INTERSPEECH, 2024 | Mayank Kumar Singh*, Naoya Takahashi, Yuki Mitsufuji, Wei-Hsiang Liao

In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional...

Read Now

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NEURIPS, 2023 | Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavann*, Aapo Hakala*, Yuichiro Koyama*, Naoya Takahashi, Shusuke Takahashi*, Tuomas Virtanen*, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper pro...

Read Now

Iteratively Improving Speech Recognition and Voice Conversion

INTERSPEECH, 2023 | Mayank Kumar Singh*, Naoya Takahashi, Onoe Naoyuki*

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task...

Read Now

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

IEEE TASLP, 2023 | Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which...

Read Now

Nonparallel Emotional Voice Conversion for unseen speaker-emotion pairs using dual domain adversarial network Virtual Domain Pairing

ICASSP, 2023 | Nirmesh Shah*, Mayank Kumar Singh*, Naoya Takahashi, Naoyuki Onoe*

Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emo...

Read Now

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP, 2023 | Kin Wai Cheuk, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi*, Dorien Herremans*, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative t...

Read Now

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP, 2023 | Naoya Takahashi, Mayank Kumar Singh*, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we...

Read Now

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR, 2023 | Hao-Wen Dong*, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley*, Taylor Berg-Kirkpatrick*

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query...

Read Now