Authors

* External authors

Venue

Date

Share

Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo

Kin Wai Cheuk

Woosung Choi

Toshimitsu Uesaka

Keisuke Toyama*

Koichi Saito

Chieh-Hsin Lai

Yuhta Takida

Wei-Hsiang Liao

Simon Dixon

Yuki Mitsufuji

* External authors

NeurIPS-24

2025

Abstract

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

Related Publications

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

Interspeech, 2025
Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var…

Training Consistency Models with Variational Noise Coupling

ICML, 2025
Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and impr…

Supervised Contrastive Learning from Weakly-labeled Audio Segments for Musical Version Matching

ICML, 2025
Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to …

  • HOME
  • Publications
  • Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.