Joan Serrà – Sony AI

Profile

Joan did an MSc and PhD in machine learning for audio at the Music Technology Group of Universitat Pompeu Fabra (2006-2011) and a postdoc in artificial intelligence at IIIA-CSIC (2011-2015). After that, he joined Telefónica R&D as a machine learning researcher (2015-2019) and Dolby Laboratories as an AI researcher and research manager (2019-2024). He is currently with Sony AI, where he performs research on machine learning, focusing on audio and multimedia analysis, synthesis, and retrieval.

Publications

Woosh: A Sound Effects Foundation Model

ARXIV, 2026 | Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji

Woosh is Sony AI's open sound effects foundation model featuring high-quality audio encoding, text-to-audio, and video-to-audio generation. Optimized for sound effects, it offers competitive performance against models like StableAudio-Open and TangoFlux, with distilled models for fast, low-resource inference.

Read Now

Emergent, not Immanent: A Baradian Reading of Explainable AI

CHI 2026, 2026 | Fabio Morreale, Joan Serrà, Yuki Mitsufuji

This paper challenges conventional assumptions in Explainable AI (XAI) by applying Barad's agential realism, arguing that AI interpretations are not fixed within models but emerge from dynamic entanglements of humans, context, and technology. It critiques existing XAI methods and proposes ethical design directions for interfaces that support emergent interpretation, illustrated through a speculative text-to-music case study.

Read Now

Automatic Music Mixing Using a Generative Model of Effect Embeddings

ICASSP, 2026 | Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Vesa Välimäki*, Yuki Mitsufuji

Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring thi...

Read Now

Automatic Music Sample Identification with Multi-Track Contrastive Learning

ICASSP, 2026 | Alain Riou, Joan Serrà, Yuki Mitsufuji

Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and...

Read Now

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

ICASSP, 2026 | Eleonora Mancini*, Joan Serrà, Paolo Torroni*

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages ...

Read Now

Towards Blind Data Cleaning: A Case Study in Music Source Separation

ICASSP, 2026 | Azalea Gui, Woosung Choi, Junghyun Koo, Kazuki Shimada, Takashi Shibuya, Joan Serrà, Wei-Hsiang Liao, Yuki Mitsufuji

The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typical...

Read Now

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NEURIPS, 2025 | Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific mod...

Read Now

Enhancing neural audio fingerprint robustness to audio degradation for music identification

ISMIR, 2025 | R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where repres...

Read Now

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

INTERSPEECH, 2025 | Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var...

Read Now

Supervised Contrastive Learning from Weakly-labeled Audio Segments for Musical Version Matching

ICML, 2025 | Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to ...

Read Now