Yuki Mitsufuji – Sony AI

Publications

Woosh: A Sound Effects Foundation Model

ARXIV, 2026 | Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji

Woosh is Sony AI's open sound effects foundation model featuring high-quality audio encoding, text-to-audio, and video-to-audio generation. Optimized for sound effects, it offers competitive performance against models like StableAudio-Open and TangoFlux, with distilled models for fast, low-resource inference.

Emergent, not Immanent: A Baradian Reading of Explainable AI

CHI 2026, 2026 | Fabio Morreale, Joan Serrà, Yuki Mitsufuji

This paper challenges conventional assumptions in Explainable AI (XAI) by applying Barad's agential realism, arguing that AI interpretations are not fixed within models but emerge from dynamic entanglements of humans, context, and technology. It critiques existing XAI methods and proposes ethical design directions for interfaces that support emergent interpretation, illustrated through a speculative text-to-music case study.

Diffusion-based Signal Refiner for Speech Enhancement and Separation

IEEE, 2026 | Ryosuke Sawata, Masato Hirano*, Naoki Murata, Shusuke Takahashi*, Yuki Mitsufuji

Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion mod...

PAVAS: Physics-Aware Video-to-Audio Synthesis

CVPR, 2026 | Oh Hyun-Bin*, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh*, Yuki Mitsufuji

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds...

MeanFlow Transformers with Representation Autoencoders

CVPR, 2026 | Zheyuan Hu*, Chieh-Hsin Lai, Ge Wu*, Yuki Mitsufuji, Stefano Ermon*

MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE)...

Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models

ICLR, 2026 | Kevin Rojas*, Ye He*, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao*

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete ...

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

ICLR, 2026 | JoungBin Lee*, Jaewoo Jung*, Jisang Han*, Takuya Narihira, Kazumi Fukuda, Junyoung Seo*, Sunghwan Hong*, Yuki Mitsufuji, Seungryong Kim*

We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditio...

LLM2Fx-Tools: Tool Calling For Music Post-Production

ICLR, 2026 | Seungheon Doh*, Junghyun Koo, Marco A. Martínez-Ramírez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu*, Juhan Nam*, Yuki Mitsufuji

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine...

SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator

ICLR, 2026 | Yuhta Takida, Satoshi Hayakawa*, Takashi Shibuya, Masaaki Imaizumi*, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji

Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and c...

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

ICLR, 2026 | Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon*, Yuki Mitsufuji

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), ...

Concept-TRAK: Understanding How Diffusion Models Learn Concepts through Concept-Level Attribution

ICLR, 2026 | Yonghyun Park*, Chieh-Hsin Lai, Satoshi Hayakawa*, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji

While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to...

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

ICLR, 2026 | Zheyuan Hu*, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon*

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion...

VIRTUE: Visual-Interactive Text-Image Universal Embedder

ICLR, 2026 | Wei-Yao Wang*, Kazuya Tateishi*, Qiyu Wu*, Shusuke Takahashi*, Yuki Mitsufuji

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack v...

Tracing the Principles Behind Modern Diffusion Models

ICLR, 2026 | Chieh-Hsin Lai, Yang Song*, Dongjun Kim*, Yuki Mitsufuji, Stefano Ermon*

Diffusion models can feel like a jungle of acronyms, but the core idea is simple: start from noise and gradually move a cloud of samples until it looks like real data. This post gives an intuition-first tour showing that DDPMs, score-based models, and flow matching are the s...

FoleyBench: A Benchmark For Video-to-Audio Models

ICASSP, 2026 | Satvik Dixit, Koichi Saito, Zhi Zhong*, Yuki Mitsufuji, Chris Donahue

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically a...

Automatic Music Mixing Using a Generative Model of Effect Embeddings

ICASSP, 2026 | Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Vesa Välimäki*, Yuki Mitsufuji

Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring thi...

Automatic Music Sample Identification with Multi-Track Contrastive Learning

ICASSP, 2026 | Alain Riou, Joan Serrà, Yuki Mitsufuji

Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and...

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

ICASSP, 2026 | Akira Takahashi*, Shusuke Takahashi*, Yuki Mitsufuji

We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can t...

Towards Blind Data Cleaning: A Case Study in Music Source Separation

ICASSP, 2026 | Azalea Gui, Woosung Choi, Junghyun Koo, Kazuki Shimada, Takashi Shibuya, Joan Serrà, Wei-Hsiang Liao, Yuki Mitsufuji

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typical...

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

ICASSP, 2026 | Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audi...

Do Foundational Audio Encoders Understand Music Structure?

ICASSP, 2026 | Keisuke Toyama*, Zhi Zhong*, Akira Takahashi*, Shusuke Takahashi*, Yuki Mitsufuji

In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automat...

Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

AAAI, 2025 | Junyoung Seo*, Jisang Han*, Jaewoo Jung*, Siyoon Jin, JoungBin Lee*, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim*, Yuki Mitsufuji

We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditiona...

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

AAAI, 2025 | Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro*, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing...

Music Arena: Live Evaluation for Text-to-Music

NEURIPS, 2025 | Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue

We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare...

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NEURIPS, 2025 | Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific mod...

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

NEURIPS, 2025 | Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*

Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thu...

Enhancing neural audio fingerprint robustness to audio degradation for music identification

ISMIR, 2025 | R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where repres...

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

TMLR, 2025 | Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov*, J. Zico Kolter*

Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transf...

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

TMLR, 2025 | Muhammad Jehanzeb Mirza, Mengjie Zhao*, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang*, Saurav Jha, Hiromi Wakaki*, Yuki Mitsufuji

In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g.,...

G2D2: Gradient-Guided Discrete Diffusion for Image Inverse Problem Solving

TMLR, 2025 | Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon*, Yuki Mitsufuji

Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete co...

Reductive, Exclusionary, Normalising: The Limits of Generative AI

TISMIR, 2025 | Fabio Morreale, Marco A. Martínez-Ramírez, Raul Masu, WeiHsiang Liao, Yuki Mitsufuji

Up until recently, most approaches to music generation were based on deductive logic: generative rules were devised on the basis of musicians’ preferences, subjective appreciation and dominant music theories. Machine learning (ML) introduced a paradigm shift: vast datasets o...

Reverse Engineering of Music Mixing Graphs With Differentiable Processors and Iterative Pruning

JAES, 2025 | Sungho Lee*, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Giorgio Fabbro*, Kyogu Lee*, Yuki Mitsufuji

Reverse engineering of music mixes aims to uncover how dry source signals are processed and combined to produce a final mix. In this paper, prior works are extended to reflect the compositional nature of mixing and search for a graph of audio processors. First, a mixing cons...

DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

DAFX, 2025 | Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas, Yuki Mitsufuji

This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for ``Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implement...

Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

WASPAA, 2025 | Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and t...

Can Large Language Models Predict Audio Effects Parameters from Natural Language?

WASPAA, 2025 | Seungheon Doh*, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam*, Yuki Mitsufuji

In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual desc...

Fx-Encoder++: Extracting Instrument-Wise Audio Effects Representations from Mixtures

ISMIR, 2025 | Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji

General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have empha...

ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

ISMIR, 2025 | Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro*, Michele Mancusi, Yuki Mitsufuji

Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to...

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

ISMIR, 2025 | Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music editing, which employ text queries to modify music (e.g. by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been co...

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

ACMMM, 2025 | Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. The success of the BAG systems depends on the effectiveness of cross-modal reasoning and spatial understan...

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

ICCV, 2025 | Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong*, Shusuke Takahashi*, Takashi Shibuya, Yuki Mitsufuji

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on th...

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

ICCV, 2025 | Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl...

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

INTERSPEECH, 2025 | Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var...

Training Consistency Models with Variational Noise Coupling

ICML, 2025 | Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and impr...

Supervised Contrastive Learning from Weakly-labeled Audio Segments for Musical Version Matching

ICML, 2025 | Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to ...

Distillation of Discrete Diffusion through Dimensional Correlations

ICML, 2025 | Satoshi Hayakawa*, Yuhta Takida, Masaaki Imaizumi*, Hiromi Wakaki*, Yuki Mitsufuji

Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenge...

Music Foundation Model as Generic Booster for Music Downstream Tasks

TMLR, 2025 | WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong*, Chieh-Hsin Lai, Giorgio Fabbro*, Kazuki Shimada, Keisuke Toyama*, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi*, Stefan Uhlich*, Taketo Akama*, Woosung Choi, Yuichiro Koyama*, Yuki Mitsufuji

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging ...

Improving Vector-Quantized Image Modeling with Latent Consistency-Matching Diffusion

IJCNN, 2025 | Bac Nguyen, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji

By embedding discrete representations into a continuous latent space, we can leverage continuous-space latent diffusion models to handle generative modeling of discrete data. However, despite their initial success, most latent diffusion methods rely on fixed pretrained embed...

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

IJCNN, 2025 | Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance align...

MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

CVPR, 2025 | Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi*, Yuki Mitsufuji

In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions accordi...

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

CVPR, 2025 | Silin Gao*, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao*, Hiromi Wakaki*, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut*

Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to th...

Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space

ICLR, 2025 | Yangming Li, Chieh-Hsin Lai, Carola-Bibiane Schönlieb, Yuki Mitsufuji, Stefano Ermon*

Deep Generative Models (DGMs), including Energy-Based Models (EBMs) and Score-based Generative Models (SGMs), have advanced high-fidelity data generation and complex continuous distribution approximation. However, their application in Markov Decision Processes (MDPs), partic...

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

CVPR, 2025 | Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel way to unders...

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CVPR, 2025 | Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained wit...

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

ICASSP, 2025 | Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong*, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee*, Wei-Hsiang Liao, Yuki Mitsufuji

Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly ...

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

ICASSP, 2025 | Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Eloi Moliner, Chieh-Hsin Lai, Stefan Uhlich*, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro*, Yuki Mitsufuji

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which ...

30+ Years of Source Separation Research: Achievements and Future Challenges

ICASSP, 2025 | Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, Yuki Mitsufuji

Source separation (SS) of acoustic signals is a research field that emerged in the mid-1990s and has flourished ever since. On the occasion of ICASSP's 50th anniversary, we review the major contributions and advancements in the past three decades in the speech, audio, and mu...

Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

NEURIPS, 2025 | Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama*, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations ...

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

NEURIPS, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often ...

LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

NEURIPS, 2025 | Mayank Kumar Singh*, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji

This paper presents a novel approach to deter unauthorized deepfakes and enable user tracking in generative models, even when the user has full access to the model parameters, by integrating key-based model authentication with watermarking techniques. Our method involves pro...

Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

ICLR, 2025 | Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

In typical multimodal contrastive learning, such as CLIP, encoders produce onepoint in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in...

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

ICLR, 2025 | Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim*, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji, Ayano Hiranaka

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward model...

Mining your own secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

ICLR, 2025 | Saurav Jha, Shiqi Yang*, Masato Ishii, Mengjie Zhao*, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi*, Yuki Mitsufuji

Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a ti...

Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

ICLR, 2025 | Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa*, Yuhta Takida, Yuki Mitsufuji

Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like -leaping ac...

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

ICLR, 2025 | Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively genera...

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

ICLR, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Rece...

Discogs-VINet-MIREX

MIREX, 2025 | Xavier Serra, Yuki Mitsufuji, R.O. Araz, J. Serrà, D. Bogdanov

This technical report presents our submission to the cover song identification task for the 2024 edition of the Music Information Retrieval Evaluation eXchange (MIREX). For this submission, we enhanced our Discogs-VINet model by changing the definition of an epoch, incorpora...

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

NEURIPS, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*

To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose ...

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

NEURIPS, 2024 | Junyoung Seo*, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim*, Yuki Mitsufuji

Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth e...

The whole is greater than the sum of its parts: improving music source separation by bridging networks

EURASIP, 2024 | Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation...

On the Language Encoder of Contrastive Cross-modal Models

ACL, 2024 | Mengjie Zhao*, Junya Ono*, Zhi Zhong*, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Takashi Shibuya, Hiromi Wakaki*, Yuki Mitsufuji, Wei-Hsiang Liao

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descri...

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

ACL, 2024 | Silin Gao*, Mete Ismayilzada*, Mengjie Zhao*, Hiromi Wakaki*, Yuki Mitsufuji, Antoine Bosselut*

Inferring contextually-relevant and diverse commonsense to understand narratives remains challenging for knowledge models. In this work, we develop a series of knowledge models, DiffuCOMET, that leverage diffusion to learn to reconstruct the implicit semantic connections bet...

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

ISMIR, 2024 | Marco Comunità*, Zhi Zhong*, Akira Takahashi*, Shiqi Yang*, Mengjie Zhao*, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high...

Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

ISMIR, 2024 | Roser Batlle-Roca*, Wei-Hsiang Liao, Xavier Serra, Yuki Mitsufuji, Emilia Gómez*

Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant challenge is the potential replication and plagiarism o...

SilentCipher: Deep Audio Watermarking

INTERSPEECH, 2024 | Mayank Kumar Singh*, Naoya Takahashi, Yuki Mitsufuji, Wei-Hsiang Liao

In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional...

SEARCHING FOR MUSIC MIXING GRAPHS: A PRUNING APPROACH

DAFX, 2024 | Sungho Lee*, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Giorgio Fabbro*, Kyogu Lee*, Yuki Mitsufuji

Music mixing is compositional -- experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available pro...

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP, 2024 | Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re...

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

TMLR, 2024 | Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Yuki Mitsufuji, Wei-Hsiang Liao

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity recon...

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

ICASSP, 2024 | Eleonora Grassucci*, Yuki Mitsufuji, Ping Zhang*, Danilo Comminiello*

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re...

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

ICASSP, 2024 | Hao Shi*, Kazuki Shimada, Masato Hirano*, Takashi Shibuya, Yuichiro Koyama*, Zhi Zhong*, Shusuke Takahashi*, Tatsuya Kawahara*, Yuki Mitsufuji

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the compl...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2024 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

Zero- and Few-shot Sound Event Localization and Detection

ICASSP, 2024 | Kazuki Shimada, Kengo Uchida, Yuichiro Koyama*, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji, Tatsuya Kawahara*

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and tempor...

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP, 2024 | Frank Cwitkowitz*, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama*, Wei-Hsiang Liao, Yuki Mitsufuji

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several wor...

SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

ICLR, 2024 | Yuhta Takida, Masaaki Imaizumi*, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its d...

Manifold Preserving Guided Diffusion

ICLR, 2024 | Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim*, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*, Stefano Ermon*

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework th...

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

ICLR, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon*

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encomp...

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

ICASSP, 2023 | Eleonora Grassucci*, Yuki Mitsufuji, Ping Zhang*, Danilo Comminiello*

Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems. Its challenge of extracting semantic information from the original complex content and regenerating semantically consistent data at the receiver, possi...

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP, 2023 | Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2023 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP, 2023 | Frank Cwitkowitz*, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama*, Wei-Hsiang Liao, Yuki Mitsufuji

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several wor...

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NEURIPS, 2023 | Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavann*, Aapo Hakala*, Yuichiro Koyama*, Naoya Takahashi, Shusuke Takahashi*, Tuomas Virtanen*, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper pro...

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR, 2023 | Keisuke Toyama*, Taketo Akama*, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capabilit...

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA, 2023 | Zhi Zhong*, Hao Shi*, Masato Hirano*, Kazuki Shimada, Kazuya Tateishi*, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced b...

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

IEEE TASLP, 2023 | Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which...

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

INTERSPEECH, 2023 | Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to impro...

GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

ICML, 2023 | Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we prop...

FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation

ICML, 2023 | Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*

Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differentia...

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP, 2023 | Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Kyogu Lee*, Yuki Mitsufuji

We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a r...

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP, 2023 | Kin Wai Cheuk, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi*, Dorien Herremans*, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative t...

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP, 2023 | Naoya Takahashi, Mayank Kumar Singh*, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we...

Unsupervised vocal dereverberation with diffusion-based generative models

ICASSP, 2023 | Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui*, Yuki Mitsufuji

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its...

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR, 2023 | Hao-Wen Dong*, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley*, Taylor Berg-Kirkpatrick*

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query...

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

ICML, 2022 | Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura*, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi*, Toshiyuki Kumakura*, Yuki Mitsufuji

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some...

Blog Posts

Revolutionizing Creativity with CTM and SAN: Sony AI's Groundbreaking Advances in Generative AI for Creators

May 10, 2024 | Sony AI, Events, Takashi Shibuya, Naoki Murata, Stefano Ermon*, Masaaki Imaizumi*, Yuki Mitsufuji, Yuhta Takida, Toshimitsu Uesaka, Chieh-Hsin Lai, Dongjun Kim*

In the dynamic world of generative AI, the quest for more efficient, versatile, and high-quality models continues to push forward without any ...

Sony AI Reveals New Research Contributions at NeurIPS 2023

December 13, 2023 | Peter Stone, Alice Xiang, Jerone Andrews, Events, Kazuki Shimada, Apostolos Modas, Tarek Besold, William Thong, Dora Zhao*, Lingjuan Lyu, Orestis Papakyriakopoulos*, Xin Dong, Nidham Gazagnadou, Weiming Zhuang, Vivek Sharma, Yuki Mitsufuji, Chen Chen

Sony Group Corporation and Sony AI have been active participants in the annual NeurIPS Conference for years, contributing pivotal research that has ...