Wei-Hsiang Liao – Sony AI

Publications

LLM2Fx-Tools: Tool Calling For Music Post-Production

ICLR, 2026 | Seungheon Doh*, Junghyun Koo, Marco A. Martínez-Ramírez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu*, Juhan Nam*, Yuki Mitsufuji

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine...

Concept-TRAK: Understanding How Diffusion Models Learn Concepts through Concept-Level Attribution

ICLR, 2026 | Yonghyun Park*, Chieh-Hsin Lai, Satoshi Hayakawa*, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji

While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to...

Automatic Music Mixing Using a Generative Model of Effect Embeddings

ICASSP, 2026 | Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Vesa Välimäki*, Yuki Mitsufuji

Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring thi...

Towards Blind Data Cleaning: A Case Study in Music Source Separation

ICASSP, 2026 | Azalea Gui, Woosung Choi, Junghyun Koo, Kazuki Shimada, Takashi Shibuya, Joan Serrà, Wei-Hsiang Liao, Yuki Mitsufuji

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typical...

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

AAAI, 2025 | Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro*, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing...

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NEURIPS, 2025 | Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific mod...

Reverse Engineering of Music Mixing Graphs With Differentiable Processors and Iterative Pruning

JAES, 2025 | Sungho Lee*, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Giorgio Fabbro*, Kyogu Lee*, Yuki Mitsufuji

Reverse engineering of music mixes aims to uncover how dry source signals are processed and combined to produce a final mix. In this paper, prior works are extended to reflect the compositional nature of mixing and search for a graph of audio processors. First, a mixing cons...

DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

DAFX, 2025 | Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas, Yuki Mitsufuji

This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for ``Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implement...

Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

WASPAA, 2025 | Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and t...

Can Large Language Models Predict Audio Effects Parameters from Natural Language?

WASPAA, 2025 | Seungheon Doh*, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam*, Yuki Mitsufuji

In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual desc...

Fx-Encoder++: Extracting Instrument-Wise Audio Effects Representations from Mixtures

ISMIR, 2025 | Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji

General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have empha...

ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

ISMIR, 2025 | Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro*, Michele Mancusi, Yuki Mitsufuji

Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to...

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

ISMIR, 2025 | Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music editing, which employ text queries to modify music (e.g. by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been co...

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

INTERSPEECH, 2025 | Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var...

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

ICASSP, 2025 | Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong*, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee*, Wei-Hsiang Liao, Yuki Mitsufuji

Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly ...

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

ICASSP, 2025 | Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Eloi Moliner, Chieh-Hsin Lai, Stefan Uhlich*, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro*, Yuki Mitsufuji

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which ...

Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

NEURIPS, 2025 | Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama*, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations ...

LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

NEURIPS, 2025 | Mayank Kumar Singh*, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji

This paper presents a novel approach to deter unauthorized deepfakes and enable user tracking in generative models, even when the user has full access to the model parameters, by integrating key-based model authentication with watermarking techniques. Our method involves pro...

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

ICLR, 2025 | Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim*, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji, Ayano Hiranaka

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward model...

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

NEURIPS, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*

To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose ...

On the Language Encoder of Contrastive Cross-modal Models

ACL, 2024 | Mengjie Zhao*, Junya Ono*, Zhi Zhong*, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Takashi Shibuya, Hiromi Wakaki*, Yuki Mitsufuji, Wei-Hsiang Liao

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descri...

Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

ISMIR, 2024 | Roser Batlle-Roca*, Wei-Hsiang Liao, Xavier Serra, Yuki Mitsufuji, Emilia Gómez*

Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant challenge is the potential replication and plagiarism o...

SilentCipher: Deep Audio Watermarking

INTERSPEECH, 2024 | Mayank Kumar Singh*, Naoya Takahashi, Yuki Mitsufuji, Wei-Hsiang Liao

In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional...

SEARCHING FOR MUSIC MIXING GRAPHS: A PRUNING APPROACH

DAFX, 2024 | Sungho Lee*, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Giorgio Fabbro*, Kyogu Lee*, Yuki Mitsufuji

Music mixing is compositional -- experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available pro...

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

TMLR, 2024 | Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Yuki Mitsufuji, Wei-Hsiang Liao

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity recon...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2024 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP, 2024 | Frank Cwitkowitz*, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama*, Wei-Hsiang Liao, Yuki Mitsufuji

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several wor...

Manifold Preserving Guided Diffusion

ICLR, 2024 | Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim*, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*, Stefano Ermon*

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework th...

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

ICLR, 2024 | Dongjun Kim*, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon*

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encomp...

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2023 | Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin...

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP, 2023 | Frank Cwitkowitz*, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama*, Wei-Hsiang Liao, Yuki Mitsufuji

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several wor...

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR, 2023 | Keisuke Toyama*, Taketo Akama*, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capabilit...

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP, 2023 | Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Kyogu Lee*, Yuki Mitsufuji

We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a r...

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

ICML, 2022 | Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura*, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi*, Toshiyuki Kumakura*, Yuki Mitsufuji

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some...