Naoya Takahashi
Publications
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
NEURIPS, 2025 | Mayank Kumar Singh*, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji
This paper presents a novel approach to deter unauthorized deepfakes and enable user tracking in generative models, even when the user has full access to the model parameters, by integrating key-based model authentication with watermarking techniques. Our method involves pro...
The whole is greater than the sum of its parts: improving music source separation by bridging networks
EURASIP, 2024 | Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji
This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation...
SilentCipher: Deep Audio Watermarking
INTERSPEECH, 2024 | Mayank Kumar Singh*, Naoya Takahashi, Yuki Mitsufuji, Wei-Hsiang Liao
In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional...
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
NEURIPS, 2023 | Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavann*, Aapo Hakala*, Yuichiro Koyama*, Naoya Takahashi, Shusuke Takahashi*, Tuomas Virtanen*, Yuki Mitsufuji
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper pro...
Iteratively Improving Speech Recognition and Voice Conversion
INTERSPEECH, 2023 | Mayank Kumar Singh*, Naoya Takahashi, Onoe Naoyuki*
Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task...
The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation
IEEE TASLP, 2023 | Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji
This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which...
Nonparallel Emotional Voice Conversion for unseen speaker-emotion pairs using dual domain adversarial network Virtual Domain …
ICASSP, 2023 | Nirmesh Shah*, Mayank Kumar Singh*, Naoya Takahashi, Naoyuki Onoe*
Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emo...
DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
ICASSP, 2023 | Kin Wai Cheuk, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi*, Dorien Herremans*, Yuki Mitsufuji
In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative t...
Hierarchical Diffusion Models for Singing Voice Neural Vocoder
ICASSP, 2023 | Naoya Takahashi, Mayank Kumar Singh*, Yuki Mitsufuji
Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we...
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos
ICLR, 2023 | Hao-Wen Dong*, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley*, Taylor Berg-Kirkpatrick*
Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query...