Sony AI at ICASSP 2026: Research Roundup

Share this article

Introduction

Sony AI will present 11 accepted papers at ICASSP 2026 in Barcelona this May. The work spans music understanding, generative audio, audio-visual alignment, and data quality. Each theme reflects a different pressure point in the field.

The accepted research does not focus on a single problem. It shares a common disposition: that audio AI systems need to be more honest about what they can and cannot do. That means better benchmarks. It means training methods that reflect how music is actually made. It means evaluation tools that track human perception rather than proxy metrics that drift from it.

The four themes in this roundup each address a different layer of that challenge.

Music understanding asks what models actually know about musical content—not just whether they can tag or transcribe it, but whether they grasp structure, lyrical meaning, and the relationships between recordings.

Generative audio asks whether creative tools can be made both controllable and fast enough for real use.

While audio-visual alignment asks how sound and image can be evaluated together, and whether the benchmarks the field relies on are measuring the right things: Data quality asks what happens before training begins and how contaminated datasets are identified, how speech models are fine-tuned efficiently when data is scarce.

These are not independent questions. A better encoder improves both generation and evaluation. A better benchmark reveals where generation models actually fail. A better data cleaning method changes what training is even possible. The work presented here reflects that interconnection.

Several of these papers also extend a line of protective AI research Sony AI advanced in 2025. That earlier work appeared at NeurIPS, ICML, and INTERSPEECH. It examined how generative music models can be made more accountable through training-data attribution. It studied how musical relationships can be recognized at the segment level. It tested how audio watermarking holds up against neural codecs. This ICASSP research continues that trajectory. WEALY builds directly on CLEWS, the segment-level version-matching model introduced at ICML 2025. Automatic Music Sample Identification extends recognition to the harder case of fragments embedded in new mixes. Blind Data Cleaning addresses what happens upstream of attribution. It identifies contaminated training data when the nature of the contamination is unknown.

Together with the 2025 portfolio, these papers move toward systems that can trace influence, recognize relationships, and safeguard integrity across the music lifecycle.

1. Music Understanding & Structure Analysis

Research Title
Leveraging Whisper Embeddings for Audio-Based Lyrics Matching

Researchers: Eleonora Mancini (University of Bologna); Joan Serrà, Yuki Mitsufuji (Sony AI); Paolo Torroni (University of Bologna)

Link to Paper

Code: https://github.com/helemanc/audio-based-lyrics-matching

Introduction to WEALY

Lyrics matching (i.e., identifying songs that share lyrical content, themes or structure) has applications in copyright enforcement, music discovery and creative assistance. The field has a reproducibility problem. Existing approaches either rely on outdated techniques, depend on text or transcriptions that may be unavailable or restricted by copyright, or use pipelines too complex to reliably replicate or compare.

This research introduces WEALY (Whisper Embeddings for Audio-based LYrics matching), a fully reproducible end-to-end pipeline that extracts lyrics-aware representations directly from raw audio without requiring intermediate transcription. The pipeline leverages decoder embeddings from Whisper and trains a transformer encoder using contrastive learning on the musical version identification (MVI) task as a proxy for lyrics matching evaluation.

Why WEALY Matters

Lyrics and audio are protected under different legal frameworks. Identifying lyrical similarity across a large music catalog requires tools that can operate at scale without depending on manually curated text databases. WEALY addresses this by working directly from audio. It extracts what the researchers describe as "lyrics-aware Whisper latents": representations of a track's lyrical content captured before the model commits to specific word predictions.

A central contribution is reproducibility. Prior Whisper-based approaches to lyrics matching and version identification reported strong results but with unclear methodology and no released code or model checkpoints. WEALY releases both, establishing transparent baselines that future research can build on.

Key Challenges

Lyrics matching requires capturing multilingual semantic relationships. The same content expressed in different languages should register as similar. The task also demands scalability across large catalogs where manual annotation is impractical. The field lacks large-scale, high-quality evaluation datasets. WEALY addresses this by using musical version identification as a proxy task across three publicly available datasets.

The pipeline also navigated a design choice common in audio-language research: whether to apply vocal source separation before feature extraction. The researchers found that even with a strong separator, Whisper's transcription quality improved only modestly. WEALY operates directly on the raw mixture, simplifying the pipeline without sacrificing performance.

Results and Conclusion

WEALY consistently outperforms transcription-based baselines across all three evaluation datasets. Ablation studies confirm that the NT-Xent contrastive loss, GeM pooling and temporal modeling via the transformer encoder are each important to final performance. Restricting Whisper to English-only decoding degrades results, indicating that multilingual cues carry meaningful retrieval signal.

When WEALY's lyrics-aware embeddings are combined with CLEWS (an audio content-based model) via distance-level late fusion, the result surpasses both unimodal approaches. As the researchers conclude, the combination "underlin[es] the complementarity of lyric and audio cues, and point[s] to promising multimodal extensions for MVI." The fusion achieves a MAP of 0.912 on SHS without introducing additional model complexity.

Research Title
Automatic Music Sample Identification with Multi-Track Contrastive Learning

Researchers: Alain Riou, Joan Serrà, Yuki Mitsufuji (Sony AI)

Link to Paper
Full training code and pretrained models are available at: github.com/sony/sampleid

Introduction

Sampling or reusing fragments of existing recordings to create new music, is a defining practice in hip-hop, electronic music, and many other genres. It is also a persistent challenge for intellectual property attribution. Identifying sampled content in a finished track requires a system that can match short audio fragments across large catalogs, even when the original material has been pitch-shifted, time-stretched, or blended into a new mix. Existing approaches have made progress on this problem, but they share a common limitation: positive training pairs are constructed from single songs, where the sample is simply a subset of the original. In practice, sampled material is embedded within a new mix or a different audio context that prior methods do not model.

This research proposes a self-supervised approach that addresses this gap directly. By leveraging a multi-track dataset, the researchers construct positive training pairs as artificial mixes on-the-fly, combining stems from different songs to simulate the conditions under which sampling actually occurs. A custom contrastive learning objective trains the model to recognize shared source material across these mixes. The result outperforms prior state-of-the-art by 15% in mean average precision.

Why It Matters

Sample identification at scale is relevant wherever music catalogs need to be searched for unauthorized or unlicensed use. A system that works robustly across genres, handles pitch and tempo transformations, and scales to large databases addresses a real operational need. Prior methods achieved meaningful results but were trained on datasets where sampling relationships were simpler than they are in practice. The multi-track training approach closes that gap by constructing training examples that reflect how samples actually appear in finished music.

The decision to release code and pretrained models also matters. As the WEALY paper in this same group of ICASSP acceptances demonstrates, reproducibility is a known problem in audio research. Open-sourcing the full training pipeline removes one barrier to future progress.

Key Challenges

Time-stretching and pitch-shifting are the transformations most commonly applied when sampling; removing either from the training pipeline substantially degrades performance. Rather than applying these operations in the audio domain — which is computationally expensive — the researchers work in the Variable-Q Transform (VQT) domain, where pitch shifts become frequency-axis crops and time stretches become temporal interpolations. This design choice makes training tractable without sacrificing the augmentation diversity that robustness requires.

A second finding concerns data quality over data quantity. When the researchers varied training set size, performance dropped only modestly at 20% of the full dataset. By contrast, reducing the granularity of available stems had a much larger effect. As the researchers note, "having high-quality ground truth separated stems yields more improvement than simply scaling up data;” a finding with broad implications for how multitrack datasets are curated.

Results and Conclusion

On the standard Sample100 benchmark, the model achieves a mean average precision of 0.603, compared to 0.442 for the previous best result. Performance on SamplePairs, a newly released evaluation set spanning more diverse genres, shows that the model's representations generalize beyond hip-hop. Hit rate at rank 1 remains stable as the number of noise songs in the reference database grows, indicating that the learned embeddings are tightly clustered around their true references even at scale.

The researchers also identify a conceptual boundary that future work will need to address: if two tracks each sample a different instrument from the same original, an ideal system should map the original close to both, but the two samples should remain far from each other, which standard contrastive loss formulations cannot simultaneously satisfy. Resolving this while maintaining scalability is flagged as an open direction for future research.

2. Generative Audio & Creative Production

Research Title
SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Researchers: Kazuki Shimada, Takashi Shibuya, Yuki Mitsufuji (Sony AI); Christian Simon, Shusuke Takahashi (Sony Group Corporation)

Link to Paper

Introduction to SAVGBench

Generative models have made significant progress in video synthesis. Most work on audio-video generation, however, overlooks a fundamental aspect of how sound behaves in the real world: it comes from somewhere. A voice arrives from the left. An instrument sounds from the right. For generated content to feel immersive and spatially coherent, audio and video must be aligned not just temporally but spatially. It’s an art and a science.

This research establishes a new research direction: Spatially Aligned Audio-Video Generation (SAVG); and introduces SAVGBench, the first benchmark designed to evaluate it. SAVGBench comprises a curated stereo audio and perspective video dataset derived from STARSS23, a dataset of spatial scene recordings, along with a novel evaluation metric, Spatial AV-Align, that measures the spatial correspondence between sound events in audio and objects in video. Two baseline methods are benchmarked: a joint audio-video generation model and a two-stage pipeline combining separate video and audio generation models.

Why SAVGBench Matters

Spatial audio-visual alignment is essential for immersive content. Virtual reality, world simulation and multimedia applications all depend on sound arriving from the correct direction relative to on-screen events. No existing benchmark was designed to evaluate this property directly. Prior audio-video generation datasets are dominated by speech and music. Most also lack the annotations needed to assess spatial correspondence.

SAVGBench addresses this by curating content where sound events are onscreen and can be reliably tracked. The Spatial AV-Align metric evaluates alignment using an object detector and a sound event localization and detection (SELD) model. Crucially, it does not require ground truth audio for the generated output, making it applicable in fully generative settings.

Key Challenges

Creating a spatially aligned dataset requires more than collecting video clips with audio. The researchers converted 360-degree video and first-order Ambisonics audio from STARSS23 into perspective video and stereo audio, tracking sound event positions throughout. Only clips where sound events were onscreen were retained. Classes were restricted to speech and instruments, the categories most reliably detected by both the object detector and SELD model.

On the modeling side, the joint generation approach required significant GPU memory, limiting video resolution to 64x64 during training and requiring a super-resolution model to produce usable outputs. The two-stage method faced a different challenge: generated video provides weaker spatial grounding than real video, which disadvantages the audio generation stage.

Results and Conclusion

The joint method produces better spatial audio-visual alignment than the two-stage approach, and this advantage shows up in both objective evaluation and a subjective listening test. Both methods still lag noticeably behind ground-truth alignment, which indicates substantial headroom. As the researchers note, results "indicate potential for further improvement" in spatial audio-visual alignment. SAVGBench provides the dataset, metric and baselines the research community needs to close that gap.

Research Title
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

Researchers: Shuyang Cui, Zhi Zhong, Qiyu Wu, Keisuke Toyama, Christian Simon, Chihiro Nagashima, Shusuke Takahashi (Sony Group Corporation); Zachary Novack (UC San Diego)

Link to Paper

Introduction

Creating drum loop audio in digital music production is time-consuming. Existing methods such as using one-shot samples or resampling require significant effort from creators, while recent generative models (despite achieving high fidelity) lack the fine-grained control that drum production specifically demands. Symbolic-to-audio research has also tended to focus on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis largely unaddressed.

Our Break-the-Beat! model addresses this gap directly by rendering drum MIDI with the timbre of a reference audio. It’s built by fine-tuning a pretrained text-to-audio model with a proposed content encoder and a hybrid conditioning mechanism. To support training, the researchers constructed a new dataset of paired target-reference drum audio drawn from existing drum audio datasets.

Experiments show that the model generates high-quality drum audio that follows high-resolution drum MIDI, with strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. The result is a controllable, generative tool for creative drum production.

Research Title
Automatic Music Mixing Using a Generative Model of Effect Embeddings

Researchers: Eloi Moliner (Aalto University / Sony AI); Marco A. Martinez-Ramirez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Yuki Mitsufuji (Sony AI); Vesa Valimaki (Aalto University)

Link to Paper
Code and audio examples are available at github.com/SonyResearch/MEGAMI

Introduction

Music mixing is the process of combining individual instrument tracks into a finished song using audio effects: equalization, compression, panning, reverb, and more. It is a task defined as much by artistic judgment as by technical criteria, and a defining feature of it is that there is no single correct answer. Multiple valid mixes can exist for the same set of input tracks, reflecting different production styles and creative choices.

By treating mixing as a deterministic regression problem (i.e., learning a single mapping from input tracks to output mix) prior approaches tend to produce conservative, averaged-out results that flatten the diversity of professional practice.

MEGAMI (Multitrack Embedding Generative Auto MIxing) addresses this directly. Rather than predicting a single mix, it models the conditional distribution of professional mixing decisions using a diffusion model that operates in an effect-embedding space. The result is the first generative framework for automatic music mixing: a system that can produce varied, high-quality mixes that reflect the genuine range of how professionals approach the same material.

Why It Matters

The shift from regression to generative modeling is not incidental. It reflects a more accurate understanding of what mixing actually is. A system that learns the distribution of professional decisions, rather than their average, can produce outputs that feel like genuine mixing choices rather than statistical compromises.

The framework also resolves a practical constraint that has limited prior work. Training automatic mixing systems typically requires paired dry (unprocessed) and wet (processed) multitrack recordings, which are rare. MEGAMI's domain adaptation strategy allows training directly on wet-only stems, far more widely available, by aligning their embeddings toward the dry domain in representation space, without relying on signal processing-based effect removal that can introduce its own artifacts.

Key Challenges

A central design problem was how to disentangle mixing decisions from musical content. MEGAMI addresses this through a latent-variable formulation: rather than generating audio directly, the diffusion model operates over effect embeddings that capture mixing characteristics separately from the musical material itself. A dedicated effect encoder (FxEncoder++) extracts these embeddings from processed tracks; a separate CLAP encoder captures content-level track semantics such as instrument type, without requiring explicit labels.

The architecture also needed to handle songs with varying numbers of tracks in no fixed order. A permutation-equivariant transformer handles this by performing self-attention across tracks and randomly permuting track order during training, preventing the model from associating a fixed position with a specific instrument role. This makes MEGAMI applicable to arbitrary multitrack sessions without requiring labeled stem categories.

Objective evaluation presented its own challenge. Conventional pairwise metrics, which compare system outputs to a single human reference, are not well suited for evaluating a generative approach where multiple valid outputs exist. The researchers instead use Kernel Audio Distance (KAD), a distributional metric that measures, as the paper describes, "the distributional distance between the set of human mixes in the benchmark and the set of mixes produced by each system."

Results and Conclusion

MEGAMI consistently outperforms all automatic mixing baselines across objective metrics, achieving the lowest distributional distances to human mixes. The variant trained on the large internal dataset of approximately 20,000 professionally mixed songs performs best among data-dependent configurations, confirming that the domain adaptation strategy scales effectively with dataset size.

The generative framework also opens directions that deterministic approaches cannot support: generating synthetic mix datasets from available recordings, enabling time-varying embeddings for dynamic mixing decisions, and modeling album-level coherence across multiple songs.

Research Title
FlashFoley: Fast Interactive Sketch2Audio Generation

Researchers: Zachary Novack, Koichi Saito, Zhi Zhong, Takashi Shibuya, Shuyang Cui, Julian McAuley, Taylor Berg-Kirkpatrick, Christian Simon, Shusuke Takahashi, Yuki Mitsufuji (UC San Diego / Sony Group Corporation / Sony AI)

Link to Paper
Audio examples are available at anonaudiogen.github.io/web

Introduction

Text-to-audio generation has matured to the point where models can produce rich, high-quality soundscapes from natural language prompts. Two capabilities that creative workflows depend on, however, have developed in isolation from each other: fine-grained control over the generated audio, and fast enough inference for real-time interaction. As the researchers put it directly, "controllable models are not fast, and fast models are not controllable." Neither is well-suited to live sound design, interactive Foley, or real-time jamming, where a creator needs to sketch a sound, like humming a pitch contour, indicating volume and brightness, and hear a result immediately.

FlashFoley resolves this tradeoff. It is the first open-source, accelerated sketch-to-audio model, capable of generating 11 seconds of stereo audio in 75 milliseconds while accepting time-varying sketch controls (like pitch, volume, and spectral brightness) extracted from a vocal or audio input. It also supports streaming generation, allowing audio output to begin while sketch controls are still being captured.

Why It Matters

Speed and control have been treated as competing objectives because the techniques used to achieve each have different requirements. FlashFoley demonstrates that they can coexist in a single open-source system. The 75ms generation latency is fast enough for offline interactive use; the streaming mode enables the first audio output to play within approximately 6 seconds as input continues. For practitioners working in real-time audio contexts — game audio, live performance, interactive installations — this combination opens workflows that existing systems cannot support. The open-source release is also significant; the leading prior system in this space, Sketch2Sound, is fully closed source.

Key Challenges

Achieving both fast inference and sketch controllability required solving them in sequence rather than simultaneously. The researchers first fine-tuned a pretrained text-to-audio model with pitch, volume, and brightness controls using a lightweight linear adaptation method — adding a single linear layer per control before the model's transformer blocks, rather than introducing the larger parameter counts that other conditioning methods require. Adversarial post-training was then applied to the already-controlled model, enabling generation in as few as 8 steps without meaningful loss of control accuracy.

A non-obvious design decision arose in adapting the adversarial training objective to time-varying controls. The contrastive component of the training procedure should apply only to the text conditioning, not the sketch controls. The researchers found that contrasting on sketch controls caused the discriminator to overfit to them, making "the generator largely ignore the text inputs and negatively affecting its ability to model higher frequency timbral information."

Streaming generation presented a separate challenge: flow-based models are not inherently streamable, as they require the full input context before generating output. Rather than retraining a causal variant, the researchers developed a zero-shot block-autoregressive algorithm that generates audio in overlapping chunks, using an equal-power crossfade to smooth boundaries. The continuous stream of local sketch controls provides fine-grained supervision that mitigates the artifacts this approach would otherwise introduce.

Results and Conclusion

Evaluated on the VimSketch dataset, FlashFoley improves control accuracy across all sketch dimensions relative to the controlled-but-slow baseline, while achieving approximately a 10x reduction in generation latency. The block-autoregressive streaming mode halves streaming latency with only modest degradation in audio quality. FlashFoley establishes a foundation for real-time, controllable audio generation and opens questions about how sketch-to-audio interfaces can be integrated into practical creative workflows.

3. Audio-Visual Alignment & Evaluation

Research Title
MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Researchers: Akira Takahashi, Shusuke Takahashi (Sony Group Corporation); Yuki Mitsufuji (Sony AI / Sony Group Corporation)

Link to Paper
Code Available at: https://github.com/sony/mmaudiosep

Introduction to MMAudioSep

Sound separation and audio generation have developed largely in parallel. Separation models learn to isolate individual sources from mixed signals. While generation models learn to synthesize audio that corresponds to video and text. The two capabilities share a common foundation: understanding the relationship between multimodal inputs and audio content. That shared knowledge has not previously been exploited across tasks.

MMAudioSep bridges this gap. It adapts MMAudio, a state-of-the-art video-to-audio generation model, for video/text-queried sound separation through fine-tuning, using a channel-concatenation conditioning mechanism that incorporates the mixture signal as an additional input. The result is a single model that performs both sound separation and video-to-audio generation using identical parameters, the first research to unify these two capabilities in one system.

Why MMAudioSep Matters

Fine-tuning a pretrained generative model for separation is more efficient than training from scratch. It also brings the rich multimodal knowledge of the generation model directly to bear on the separation task. This cross-domain knowledge transfer addresses a gap that has limited separation models: they are typically trained only on audio, without access to the broader semantic understanding that generation models develop through multimodal training.

The retention of video-to-audio generation capability after fine-tuning is also significant. As the researchers note, this "highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks" without sacrificing their original functionality.

Key Challenges

The central technical challenge was incorporating the mixture signal into MMAudio's architecture without disrupting its learned multimodal representations. The researchers achieved this through a channel-concatenation conditioning mechanism, concatenating noise and mixture latents along the channel dimension before processing. During fine-tuning, only the audio projection layer and multimodal transformer blocks were updated. Other parameters remained frozen, preserving the model's visual and textual knowledge.

Evaluation presented a separate challenge. As a generative approach, MMAudioSep is not well-served by conventional separation metrics like signal-to-distortion ratio. The researchers instead used generative metrics including Frechet Audio Distance, CLAP-based semantic alignment scores and audio-visual alignment measures.

Results and Conclusion

MMAudioSep outperforms AudioSep and FlowSep across nearly all metrics on the VGGSound-Clean and MUSIC evaluation datasets. Performance improves further when both text and video conditions are used together. On the video-to-audio generation benchmark, MMAudioSep performs comparably to traditional generation methods despite having been fine-tuned for separation, confirming that the core generation capability is preserved.

Pretrained initialization consistently outperforms training from scratch, validating the hypothesis that cross-domain knowledge transfer from generation to separation is beneficial. The authors identify universal sound separation, extending the model to a broader range of sound categories, as a key direction for future work.

Research Title
FoleyBench: A Benchmark for Video-to-Audio Models

Researchers: Satvik Dixit (Carnegie Mellon University); Koichi Saito (Sony AI); Zhi Zhong (Sony Group Corporation); Yuki Mitsufuji (Sony AI / Sony Group Corporation); Chris Donahue (Carnegie Mellon University)

Link to Paper

Dataset samples are available at gclef-cmu.org/foleybench

Introduction

Video-to-audio generation has advanced rapidly, with a growing family of models capable of synthesizing sound from visual input. Evaluating these models has not kept pace. The field's de facto benchmark—the VGGSound test set—was not designed for Foley-style evaluation and contains significant content that falls outside its intended scope: speech, music, and clips where the audio is not causally linked to on-screen events. The researchers find that 74% of VGGSound clips have poor audio-visual correspondence by Foley standards. A model can score well on this benchmark by handling speech and music, (which are handled separately in professional production workflows), while performing poorly on the non-speech, non-music sound effects that Foley actually requires.

FoleyBench is the first large-scale benchmark built specifically for Foley-style video-to-audio evaluation: 5,000 curated video-audio-caption triplets in which every clip contains visible sound sources with audio causally tied to on-screen events, free of speech and music. Each clip is labeled with metadata capturing source complexity, sound envelope type, and category, enabling analysis of model behavior that aggregate scores cannot provide.

Why It Matters

Benchmarks shape research priorities. A benchmark dominated by speech and music steers model development toward those domains, even when the stated goal is Foley synthesis. FoleyBench reorients evaluation toward the specific properties that Foley requires: temporal synchronization between sound and visible action, semantic accuracy relative to on-screen events, and audio fidelity across a diverse range of non-musical sound classes.

The fine-grained metadata is also a meaningful contribution in its own right. By tagging clips for source complexity (single vs. multi-source) and sound envelope type (discrete events vs. continuous ambience), FoleyBench makes it possible to identify where specific models fail, not just how they rank overall.

Key Challenges

Constructing a clean Foley benchmark from internet video required a two-stage filtering pipeline. A first pass using the YAMNet audio classifier removed clips containing speech or music, eliminating 97.7% of source material. A second pass using Gemini 2.5 Pro assessed whether the remaining audio was causally grounded in visible on-screen events. The researchers describe this as checking whether "the sounds are causally and temporally grounded in visible on-screen actions" — for example, if the audio is clapping, the video must show visible hands clapping in sync. Together these stages yield a 72% precision rate for Foley relevance, compared to 25.5% when the same pipeline is applied to VGGSound.

Category diversity was a separate concern. VGGSound, even after filtering, is heavily skewed toward a small number of sound classes. FoleyBench achieves a Shannon entropy of 5.35 across UCS categories, compared to 4.73 for the filtered VGGSound subset — a more balanced distribution that prevents evaluation from being dominated by a model's performance on common sounds.

Results and Conclusion

Across twelve state-of-the-art V2A models, FoleyBench surfaces patterns that VGGSound does not. MMAudio achieves the strongest overall performance; however, the fine-grained analysis reveals consistent failure modes. On discrete sound events, models improve at temporal synchronization but degrade significantly in audio fidelity — suggesting that while visual cues provide a clear temporal signal, current models "fail to render the corresponding high-fidelity impact." Performance on background ambience and multi-source scenes is consistently weaker than on single-source action clips.

FoleyBench-Long, a supplementary set of 650 thirty-second clips, reveals a further gap: models that perform well on short clips suffer substantial quality degradation at longer durations, with MMAudio's Frechet Audio Distance worsening from 8.76 to 27.5 in the long-form setting. By releasing both the benchmark and the full data pipeline, the researchers aim to provide the infrastructure for more targeted and reliable progress in video-to-audio generation.

Research Title
Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Researchers: Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik (Sony Research India)

Link to Paper

Introduction

AI dubbing has advanced significantly with progress in neural machine translation, text-to-speech synthesis, and audio-visual synchronization. Evaluating the quality of dubbed content is the next frontier. Existing automatic metrics assess isolated dimensions, such as speech naturalness, lip sync, intelligibility but none capture the holistic impression that a human viewer forms when watching dubbed content. That impression is shaped simultaneously by prosody, speaker identity, emotional consistency, semantic accuracy, and temporal alignment. No single metric accounts for all of these at once.

Human Mean Opinion Scores (MOS) remain the standard for perceptual evaluation, but collecting them at scale is costly and impractical. This research addresses both problems: it introduces a hierarchical multimodal architecture that predicts perceived dubbing quality from audio, video, and text inputs, and a scalable weak supervision strategy called Proxy MOS that enables training without requiring large volumes of human annotations.

Why It Matters

Dubbing adoption is increasing across the industry at varying levels of maturity, and the need for automatic evaluation that reflects human perception grows alongside it. The framework presented here is not limited to AI-generated content; it applies equally to manually dubbed output, positioning it as a broader perceptual quality metric for dubbed content rather than a tool tied solely to automation. A metric that correlates reliably with how viewers actually experience dubbed content enables faster iteration, more meaningful system comparisons, and quality control at scale. The proposed architecture achieves a Pearson correlation coefficient above 0.75 with human ratings; a meaningful threshold for a task where even human raters show only moderate agreement with one another (ICC1 = 0.69).

The Proxy MOS strategy also has broader relevance. The active learning approach used to learn weights across objective metrics, i.e. incrementally expanding the labeled pool by prioritizing uncertain and diverse samples, is a general framework for situations where human annotations are scarce but multiple automatic signals are available.

Key Challenges

Dubbing quality is multidimensional in a way that creates a specific modeling challenge: fusing audio, video, and text signals directly risks one modality dominating or obscuring the others. The researchers address this through staged fusion, reflecting the observation that "fusing these directly risks information loss or modality dominance." First consolidating cues within each modality through intra-modal attention-based gating, the architecture then integrates across modalities through a transformer encoder, mirroring how human evaluators likely process dubbed content before forming an overall judgment.

A separate challenge was scale. The full dataset of approximately 12,000 dubbed clips (drawn from the MELD and M2H2 datasets, dubbed bidirectionally between Hindi and English) was too large to annotate with human MOS. Proxy MOS fills this gap by aggregating five objective metrics covering audio-visual synchrony, emotional alignment, speaker consistency, speech quality, and semantic accuracy. Weights for each metric are learned through an active learning scheme guided by a small subset of human ratings, rather than assigned equally.

Results and Conclusion

Ablation experiments across unimodal, bimodal, and full multimodal configurations confirm that each modality contributes, and that their combination outperforms any subset. Audio provides the strongest individual signal; video alone contributes little in isolation; text adds complementary semantic grounding. The full audio-video-text system achieves PCC = 0.76 and SRCC = 0.77 against human MOS.

Active learning consistently outperforms random sampling for Proxy MOS weight estimation. At full budget, the active learning strategy reaches PCC = 0.82 and SRCC = 0.81 — a statistically significant improvement over random sampling. The researchers conclude that "AL-based proposed Proxy MOS is more effective than simple averaging and that its combination with human supervision provides the most reliable perceptual predictions."

4. Data Quality, Robustness & Speech Processing

Research Title
Do Foundational Audio Encoders Understand Music Structure?

Researchers: Keisuke Toyama, Zhi Zhong, Akira Takahashi, Shusuke Takahashi (Sony Group Corporation); Yuki Mitsufuji (Sony AI / Sony Group Corporation)

Link to Paper

Introduction

Pretrained foundational audio encoders (FAEs) have become widely used across music information retrieval tasks, improving performance on music tagging, transcription and source separation. Their application to music structure analysis (MSA), the task of segmenting a recording into functional sections like verse, chorus and bridge, is largely unexplored. Only two FAEs had been examined for MSA prior to this work. The factors that determine whether a given encoder suits the task were not well understood.

This research conducts a comprehensive evaluation of 11 FAEs for music structure analysis, examining the effect of learning method, training data and model context length on performance. The study uses linear probing, connecting a single linear layer to frozen FAE features, so that observed differences in performance can be attributed to the encoders themselves rather than downstream model complexity.

Why It Matters

Understanding which FAEs are effective for music structure analysis has implications beyond MSA. Many of these encoders serve as evaluation metrics, as the backbone of Frechet Audio Distance (FAD) and related measures for assessing the quality of generated music. If the encoders used in these metrics do not actually understand music structure, their validity for evaluating long-form music generation is open to question. This research provides the first systematic evidence to guide those choices.

Key Challenges

Music structure analysis is a long-term task. Identifying a chorus requires understanding how it relates to the verse that precedes it, which requires processing music over time spans that many encoders were not designed to handle. The study investigates whether context length, frame rate and training objective each contribute to this capacity, and finds that they do in distinct ways.

A particular challenge is distinguishing between encoders that capture semantic musical content and those that capture fine-grained acoustic detail. Codec-based models like EnCodec and DAC perform poorly on MSA despite strong performance on other audio tasks.

Results and Conclusion

FAEs trained with self-supervised masked language modeling on music data, particularly MusicFM, which processes 30-second contexts, achieve the strongest performance across boundary detection and function prediction. Supervised and contrastive learning models fall consistently short, as do codec-based encoders. The researchers attribute this to their optimization for acoustic detail rather than semantic structure.

Context length is a meaningful factor. MusicFM's 30-second context window appears to allow the model to capture the repeating, section-level structure of music in ways that 5-second context models cannot. Training on full-track or long-form music data is also identified as a contributing factor. As the researchers conclude, "FAEs trained with self-supervised masked language models trained on music data with a long context length achieve the strongest performance," and suggest MLM-trained models as improved backbones for music generation evaluation metrics.

Research Title
Towards Blind Data Cleaning: A Case Study in Music Source Separation

Researchers: Azalea Gui (University of Toronto / Sony AI); Woosung Choi, Junghyun Koo, Kazuki Shimada, Takashi Shibuya, Joan Serrà, Wei-Hsiang Liao, Yuki Mitsufuji (Sony AI)

Link to Paper

Introduction

Model performance in music source separation depends heavily on training data quality. In practice, datasets are often contaminated by artifacts that are difficult to detect automatically: audio bleeding between stems, label noise or other corruptions whose type and extent are unknown in advance. Targeted cleaning methods that address specific artifact types are impractical when the nature of the contamination is unclear.

This research formalizes the problem as blind data cleaning and proposes two noise-agnostic approaches. The first is based on data attribution via unlearning. The second uses the Frechet Audio Distance as a distributional similarity measure. Both methods require only a small set of trusted clean reference samples and operate without prior knowledge of what kind of noise is present.

Why It Matters

The dominant approach to improving model performance is architectural: better models, more parameters, refined training procedures. This work makes the case that data quality is at least as important, and that it can be addressed systematically without needing to know what is wrong with the data in advance. The noise-agnostic design is deliberate. A method that requires knowing the artifact type cannot scale to the messy, varied reality of large real-world datasets.

The unlearning-based approach introduces an efficient inversion of standard influence estimation. Rather than unlearning each training sample to estimate its influence, (which is computationally prohibitive at scale), the method unlearns a small set of clean reference samples and measures the resulting change in loss across the training set.

Key Challenges

Identifying which training samples to remove without knowing what is wrong with them is inherently difficult. The researchers address this through the mirrored influence hypothesis: training samples that appear less consistent with the trusted clean reference set tend to have lower attribution scores, making them candidates for removal.

Avoiding catastrophic forgetting during the unlearning process presented a secondary challenge. The researchers use elastic weight consolidation to preserve the model's general knowledge while selectively updating its representation of specific samples.

Results and Conclusion

Both the unlearning-based and FAD-based cleaning methods improve music source separation performance on a semi-synthetic contaminated dataset. As the researchers note, the result "closes approximately 66.7% of the performance gap between the contaminated baseline and a model trained on the same dataset without any contamination."

Generalization experiments using a dataset with unseen audio effects confirm that the noise-agnostic methods remain effective on artifact types they were not designed for. A specialist MLP-based classifier, by contrast, fails to generalize beyond its training conditions. This distinction is a central finding: broadly applicable methods outperform specialist approaches when the nature of the data corruption is unknown.

Research Title
Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-Resource Speech Recognition

Researchers: Aditya Srinivas Menon, Kumud Tripathi, Raj Gohil, Pankaj Wasnik (Sony Research India)

Link to Paper

Introduction

Self-supervised learning models have become a foundational tool in speech processing, enabling strong performance on tasks from automatic speech recognition to speaker verification with minimal labeled data. Their core limitation is computational: the self-attention mechanism that powers these models scales quadratically with input length, making fine-tuning expensive in terms of both memory and time. This cost is particularly acute in low-resource settings, where training data is scarce and hardware constraints are real.

SummaryMixing, (a prior linear-time alternative to self-attention), addressed part of this problem by replacing pairwise attention with a global mean summary across the full utterance. But its global summary, as the researchers explain, "lacks adequate local context, limiting fine-grained temporal modeling essential for effective speech representation," which is a meaningful limitation for speech, where timing carries phonetic and prosodic information.

This research introduces Windowed SummaryMixing (WSM), an extension that preserves SummaryMixing's linear complexity while adding a neighborhood-level summary computed over a local window of frames around each timestep. Paired with a selective fine-tuning strategy that replaces only the final two self-attention layers of a pretrained SSL model with WSM blocks, the approach improves speech recognition performance while reducing peak VRAM usage by 40%.

Why It Matters

The practical case for WSM is straightforward: it makes SSL fine-tuning more accessible. Models fine-tuned with WSM require 30-32 GB of GPU VRAM compared to 50 GB for attention-based variants, and they train faster. For teams working in low-resource language settings where labeled data is limited and compute budgets may be constrained, this efficiency is not incidental; it determines what is feasible.

The selective fine-tuning strategy also addresses a known failure mode of SSL adaptation. Fine-tuning all layers of a large pretrained model on limited data tends to cause overfitting and poor generalization. By freezing the bulk of the network and updating only the newly introduced WSM layers, the approach preserves the model's pretrained representations while adapting its temporal processing to the target task.

Key Challenges

A core design question was how much local context to include in the window summary. The researchers swept window sizes across multiple values and found that a window of five frames in each direction, 11 frames total, consistently produced the best word error rates across both monolingual and multilingual settings. Wider windows did not improve results and increased computation; narrower windows left temporal dependencies underspecified.

A second challenge was determining how many attention layers to replace. The results show that "replacing only the last two layers offers the best trade-off between WER and computational cost" while replacing more reintroduces overfitting risk, and lastly, replacing fewer limits the efficiency gains.

Results and Conclusion

Evaluated across six languages: Hindi, Tamil, Mexican Spanish, Mandarin, Arabic, and American English, and six SSL models including wav2vec 2.0, HuBERT, data2vec, XLS-R, mHuBERT, and MMS, WSM-based fine-tuning consistently matches or improves on standard attention baselines. Multilingual models show the largest relative gains: XLS-R reduces word error rate on Spanish from 28.09% to 26.42% and on Arabic from 40.34% to 38.21%. Inference speed also improves at longer input lengths; by 100 seconds, WSM-based models are approximately 25% faster than attention-based variants.

The results confirm that adding local temporal context to SummaryMixing's global summary produces a more capable and still computationally efficient alternative to self-attention, one well-suited for the practical demands of low-resource speech recognition.

Conclusion

Taken together, these 11 papers address the gap between what audio AI systems produce and what they actually understand.

The music understanding research makes this explicit. The generative papers push in a different direction. MEGAMI, Break-the-Beat!, and FlashFoley each expand what creators can potentially do: automating mixing decisions that were previously manual, synthesizing drum audio with controllable timbre, enabling real-time sound design. The evaluation papers are honest about current limits. FoleyBench documents what the field's standard benchmark gets wrong. The AI dubbing paper builds a perceptual metric that tracks human judgment rather than replacing it. Both are acts of measurement before improvement. The data quality papers address the foundation. Blind data cleaning works without knowing what is wrong.

But, none of these papers claims to have outright solved the problem it addresses. That restraint is part of the contribution. The work is more useful for being specific about where progress ends and open questions begin, and offering solutions to pervasive problems or challenges in artificial intelligence.

Sony AI will be presenting this research at ICASSP 2026, May 4-8 , in Barcelona. Read the full papers and follow our work at ai.sony

– Sights on AI: Yuki Mitsufuji Shares Inspiration for AI Research into Music and Sound