Authors

Venue

Date

Share

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Yixiao Zhang

Yukara Ikemiya

Woosung Choi

Naoki Murata

Marco A. Martínez-Ramírez

Liwei Lin

Gus Xia

Wei-Hsiang Liao

Yuki Mitsufuji

Simon Dixon

ISMIR-25

2025

Abstract

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

Related Publications

Music Arena: Live Evaluation for Text-to-Music

NeurIPS, 2025
Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue

We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare…

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NeurIPS, 2025
Woosung Choi, Junghyun Koo*, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific mod…

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

NeurIPS, 2025
Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*

Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thu…

  • HOME
  • Publications
  • Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.