Authors

* External authors

Venue

Date

Share

The whole is greater than the sum of its parts: improving music source separation by bridging networks

Ryosuke Sawata

Naoya Takahashi

Stefan Uhlich*

Shusuke Takahashi*

Yuki Mitsufuji

* External authors

EURASIP

2024

Abstract

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size.

Related Publications

Schemato -- An LLM for Netlist-to-Schematic Conversion

MLCAD, 2025
Ryoga Matsuo, Stefan Uhlich*, Arun Venkitaraman, Andrea Bonetti, Chia-Yu Hsieh, Ali Momeni, Lukas Mauch*, Augusto Capone, Eisaku Ohbuchi, Lorenzo Servadei

Machine learning models are advancing circuit design, particularly in analog circuits. They typically generate netlists that lack human interpretability. This is a problem as human designers heavily rely on the interpretability of circuit diagrams or schematics to intuitivel…

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

ICCV, 2025
Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong*, Shusuke Takahashi*, Takashi Shibuya, Yuki Mitsufuji

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on th…

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

ICCV, 2025
Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl…

  • HOME
  • Publications
  • The whole is greater than the sum of its parts: improving music source separation by bridging networks

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.