Authors

Venue

Date

Share

Enhancing neural audio fingerprint robustness to audio degradation for music identification

R. Oguz Araz

Guillem Cortès-Sebastià

Emilio Molina

Joan Serrà

Xavier Serra

Yuki Mitsufuji

Dmitry Bogdanov

ISMIR-25

2025

Abstract

Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealistically simulates real-life audio degradation during training, resulting in sub-optimal supervision. Additionally, although several modern metric learning approaches have been proposed, current neural AFP methods continue to rely on the NT-Xent loss without exploring the recent advances or classical alternatives. In this work, we propose a series of best practices to enhance the self-supervision by leveraging musical signal properties and realistic room acoustics. We then present the first systematic evaluation of various metric learning approaches in the context of AFP, demonstrating that a self-supervised adaptation of the triplet loss yields superior performance. Our results also reveal that training with multiple positive samples per anchor has critically different effects across loss functions. Our approach is built upon these insights and achieves state-of-the-art performance on both a large, synthetically degraded dataset and a real-world dataset recorded using microphones in diverse music venues.

Related Publications

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

TMLR, 2025
Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov*, J. Zico Kolter*

Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transf…

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

TMLR, 2025
Muhammad Jehanzeb Mirza, Mengjie Zhao*, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang*, Saurav Jha, Hiromi Wakaki*, Yuki Mitsufuji

In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g.,…

G2D2: Gradient-Guided Discrete Diffusion for Image Inverse Problem Solving

TMLR, 2025
Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon*, Yuki Mitsufuji

Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete co…

  • HOME
  • Publications
  • Enhancing neural audio fingerprint robustness to audio degradation for music identification

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.