Authors

* External authors

Venue

Date

Share

Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao

Yuhta Takida

Yukara Ikemiya

Zhi Zhong*

Chieh-Hsin Lai

Giorgio Fabbro*

Kazuki Shimada

Keisuke Toyama*

Kinwai Cheuk

Marco A. Martínez-Ramírez

Shusuke Takahashi*

Stefan Uhlich*

Taketo Akama*

Woosung Choi

Yuichiro Koyama*

Yuki Mitsufuji

* External authors

TMLR

2025

Abstract

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
Submission Length: Regular submission (no more than 12 pages of main content)

Related Publications

Schemato -- An LLM for Netlist-to-Schematic Conversion

MLCAD, 2025
Ryoga Matsuo, Stefan Uhlich*, Arun Venkitaraman, Andrea Bonetti, Chia-Yu Hsieh, Ali Momeni, Lukas Mauch*, Augusto Capone, Eisaku Ohbuchi, Lorenzo Servadei

Machine learning models are advancing circuit design, particularly in analog circuits. They typically generate netlists that lack human interpretability. This is a problem as human designers heavily rely on the interpretability of circuit diagrams or schematics to intuitivel…

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

ICCV, 2025
Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong*, Shusuke Takahashi*, Takashi Shibuya, Yuki Mitsufuji

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on th…

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

ICCV, 2025
Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl…

  • HOME
  • Publications
  • Music Foundation Model as Generic Booster for Music Downstream Tasks

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.