Skip to content
Headshot of Koichi Saito

Koichi Saito

Publications

FoleyBench: A Benchmark For Video-to-Audio Models

ICASSP, 2026 | Satvik Dixit, Koichi Saito, Zhi Zhong*, Yuki Mitsufuji, Chris Donahue

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically a...

Music Arena: Live Evaluation for Text-to-Music

NEURIPS, 2025 | Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue

We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare...

Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

NEURIPS, 2025 | Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama*, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations ...

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

NEURIPS, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often ...

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

ICLR, 2025 | Koichi Saito, Dongjun Kim*, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong*, Yuhta Takida, Yuki Mitsufuji

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Rece...

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

ISMIR, 2024 | Marco Comunità*, Zhi Zhong*, Akira Takahashi*, Shiqi Yang*, Mengjie Zhao*, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2024 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2023 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

ICML, 2023 | Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we prop...

Unsupervised vocal dereverberation with diffusion-based generative models

ICASSP, 2023 | Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui*, Yuki Mitsufuji

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its...