Yuhta Takida
Publications
PAVAS: Physics-Aware Video-to-Audio Synthesis
CVPR, 2026 | Oh Hyun-Bin*, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh*, Yuki Mitsufuji
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds...
Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models
ICLR, 2026 | Kevin Rojas*, Ye He*, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao*
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete ...
SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator
ICLR, 2026 | Yuhta Takida, Satoshi Hayakawa*, Takashi Shibuya, Masaaki Imaizumi*, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji
Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and c...
Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
ICLR, 2026 | Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon*, Yuki Mitsufuji
Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), ...
Concept-TRAK: Understanding How Diffusion Models Learn Concepts through Concept-Level Attribution
ICLR, 2026 | Yonghyun Park*, Chieh-Hsin Lai, Satoshi Hayakawa*, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji
While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to...
Large-Scale Training Data Attribution for Music Generative Models via Unlearning
NEURIPS, 2025 | Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji
This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific mod...
G2D2: Gradient-Guided Discrete Diffusion for Image Inverse Problem Solving
TMLR, 2025 | Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon*, Yuki Mitsufuji
Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete co...
Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models
ICCV, 2025 | Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji
Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl...
Training Consistency Models with Variational Noise Coupling
ICML, 2025 | Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and impr...
Distillation of Discrete Diffusion through Dimensional Correlations
ICML, 2025 | Satoshi Hayakawa*, Yuhta Takida, Masaaki Imaizumi*, Hiromi Wakaki*, Yuki Mitsufuji
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenge...
Music Foundation Model as Generic Booster for Music Downstream Tasks
TMLR, 2025 | WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong*, Chieh-Hsin Lai, Giorgio Fabbro*, Kazuki Shimada, Keisuke Toyama*, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi*, Stefan Uhlich*, Taketo Akama*, Woosung Choi, Yuichiro Koyama*, Yuki Mitsufuji
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging ...
Improving Vector-Quantized Image Modeling with Latent Consistency-Matching Diffusion
IJCNN, 2025 | Bac Nguyen, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji
By embedding discrete representations into a continuous latent space, we can leverage continuous-space latent diffusion models to handle generative modeling of discrete data. However, despite their initial success, most latent diffusion methods rely on fixed pretrained embed...
MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training
CVPR, 2025 | Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi*, Yuki Mitsufuji
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions accordi...
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
CVPR, 2025 | Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to unders...
VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression
ICASSP, 2025 | Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong*, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee*, Wei-Hsiang Liao, Yuki Mitsufuji
Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly ...
Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
NEURIPS, 2025 | Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama*, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji
Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations ...
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
NEURIPS, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often ...
Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
ICLR, 2025 | Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji
In typical multimodal contrastive learning, such as CLIP, encoders produce onepoint in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in...
Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models
ICLR, 2025 | Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa*, Yuhta Takida, Yuki Mitsufuji
Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like -leaping ac...
SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
ICLR, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji
Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Rece...
PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
NEURIPS, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose ...
On the Language Encoder of Contrastive Cross-modal Models
ACL, 2024 | Mengjie Zhao*, Junya Ono*, Zhi Zhong*, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Takashi Shibuya, Hiromi Wakaki*, Yuki Mitsufuji, Wei-Hsiang Liao
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descri...
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
ICASSP, 2024 | Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re...
HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes
TMLR, 2024 | Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Yuki Mitsufuji, Wei-Hsiang Liao
Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity recon...
SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer
ICLR, 2024 | Yuhta Takida, Masaaki Imaizumi*, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji
Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its d...
Manifold Preserving Guided Diffusion
ICLR, 2024 | Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim*, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*, Stefano Ermon*
Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework th...
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
ICLR, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon*
Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encomp...
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
ICASSP, 2023 | Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re...
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
ISMIR, 2023 | Keisuke Toyama*, Taketo Akama*, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji
Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capabilit...
Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement
INTERSPEECH, 2023 | Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji
Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to impro...
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration
ICML, 2023 | Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we prop...
FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation
ICML, 2023 | Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*
Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differentia...
Unsupervised vocal dereverberation with diffusion-based generative models
ICASSP, 2023 | Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui*, Yuki Mitsufuji
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its...
SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization
ICML, 2022 | Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura*, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi*, Toshiyuki Kumakura*, Yuki Mitsufuji
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some...
Blog Posts
Revolutionizing Creativity with CTM and SAN: Sony AI's Groundbreaking Advances in Generative AI for Creators
May 10, 2024 | Sony AI, Events, Takashi Shibuya, Naoki Murata, Stefano Ermon*, Masaaki Imaizumi*, Yuki Mitsufuji, Yuhta Takida, Toshimitsu Uesaka, Chieh-Hsin Lai, Dongjun Kim*
In the dynamic world of generative AI, the quest for more efficient, versatile, and high-quality models continues to push forward without any ...