Introducing Woosh: Sony AI's Sound Effect Foundation Model

Written by Admin | May 18, 2026 4:46:22 PM

Sony AI has been working on a problem most generative audio models have overlooked: sound effects. Specifically, the tools that sound designers, game audio directors, and creative professionals rely on to build the sonic worlds behind games and film.

Most audio AI research focuses on music or general audio generation. The open-source tools available to researchers and developers reflect that gap. Woosh is Sony AI's attempt to fill it: a foundation model built from the ground up for sound effect generation, designed around the workflows and quality standards that professional sound designers actually use. The team trained two versions in parallel: a private model optimized on licensed, studio-grade libraries for professional-grade output, and a public counterpart trained on openly available datasets. The public model, released with open weights and inference code, is intended to give the research community a meaningful starting point and to demonstrate what purpose-built sound effect models can do. The name itself comes from the team: "Woosh" was chosen because it is one of the most common sound effects in gaming and film; a small signal of what the model is actually for.

"There's no model that really is specific to sound effects," said Hakim Missoum, Strategy & Partnerships Manager at Sony AI. "And in order to get good quality output, you need good quality input."

That hypothesis shaped the project from the start. The team licensed professional sound effect libraries, including Pro Sound Effects and BOOM, and used that data to train a private model optimized for studio-grade output. Woosh is the public counterpart: the same architecture, trained on publicly available datasets, and released for the research community to access.

Quality Data = Professional Results

One of Woosh's clearest findings is that the gap between public and private training data is significant; and it is deliberate.

Public audio datasets, while large, tend to capture real-world recordings: ambient sound, overlapping noise, loosely labeled audio scenes. Professional sound effect libraries are vastly different. They contain isolated, purpose-recorded sounds; a cat meowing in three different registers, a door closing under controlled conditions. They also carry precise technical annotations that reflect how professionals actually search for and describe audio. As the researchers note, the annotation style provided in professional libraries can be "highly mismatched" to those found in public datasets; a gap that affects not just audio quality, but how well a model understands the vocabulary professionals use to describe sound.

"Many of these models can feel like a gimmick; you just input text and get something," said Marc Ferras, Staff AI Engineer, Sony AI. "We really don't believe that professionals are going to use that. We want to create solutions that are going to support specific controls."

The evaluation results in the technical report reflect this divide. The private model, trained on commercial libraries, significantly outperforms public alternatives on professional sound effect data. The public model outperforms comparable open-source models on public benchmarks.

Ferras put it plainly: in each of those two categories — professional sound effects and public benchmarks, Woosh is currently leading.

From Text to Video

Woosh addresses two generation tasks.

The first is text-to-audio: generating a sound effect from a written description.

The second is video-to-audio: generating sound directly from a video sequence, with an optional text prompt to guide the output.

The video-to-audio capability is particularly relevant for game and film production workflows, where sound designers are frequently working from visual content rather than abstract descriptions.

On the FoleyBench benchmark, a dataset designed specifically for visually grounded audio generation, Woosh's video-to-audio model outperforms the comparable baseline across audio quality and semantic alignment metrics, while using fewer parameters.

Built for Professional Workflows

Beyond benchmark performance, the team has been working toward integration with the tools sound designers already use.

"The main feedback we got most of the time is that they want more control; more intuitive control," Missoum said. The team is developing a plugin for digital audio workstations, with planned support for variation generation, inpainting (the ability to complete a region of audio so that it stitches smoothly with an existing sound), and personalization.

Watch the Demo here:

"With this plugin we can integrate seamlessly into those pipelines and workflows and tools in a way that sound designers can use more intuitively," Missoum explained.

The initial public release focuses on the core generative models; additional controls are planned as the ecosystem develops. The roadmap includes precise time controls, morphing (transforming one sound into another using a semantic description of the target), generation of perfect loops, and personalization from one or a small number of audio samples; capabilities that reflect the kind of granular creative control professionals have said they need.

An Open Foundation

The public release is non-commercial, with inference code and model weights available for research and experimentation. The licensing reflects a deliberate strategy. The public model is designed to demonstrate what the technology can do and invite the community to build on it; the private model, trained on licensed studio-quality data, points toward commercial application. As Missoum puts it, the public release "prepares the ground" for what comes next.

That framing also shapes how the team thinks about the broader conversation around generative AI and creative work. The team is conscious that tools like Woosh will draw scrutiny from sound professionals who worry about displacement. Their goal is to understand where AI can work as a tool to support the human creative process. The controls being built into the plugin, and the decision to train on licensed professionally curated libraries rather than scraped public data, are both expressions of that commitment.

For teams interested in professional-grade output, Sony AI's private model, trained on licensed studio-quality data from Pro Sound Effects and BOOM, represents the performance ceiling the public release points toward.

"It prepares the ground for the professional model we're developing," Missoum said. "The performance is not the same; and that's the point."

To explore Woosh, access the model weights, and listen to demo samples, visit: https://sonyresearch.github.io/Woosh/

And to access the Woosh-Flow Private, please visit:
https://sonyresearch.github.io/Woosh/flow-private.html

To dive into the paper, visit:
Woosh: A Sound Effects Foundation Model - Sony AI

View full post