Meta brings Segment Anything to audio, letting editors pull sounds from video with a click or text prompt

amu · December 26, 2025, 6:54pm

Meta Extends Segment Anything Model to Audio, Enabling Precise Sound Isolation in Videos

Meta AI researchers have introduced AudioSAM, an innovative extension of the Segment Anything Model (SAM) family, specifically tailored for audio segmentation. This breakthrough allows video editors and content creators to isolate individual sound sources from mixed audio tracks with remarkable precision, using either a simple click on a visual representation or a natural language text prompt. By bridging visual and auditory segmentation, AudioSAM democratizes advanced audio editing, making it accessible without requiring extensive expertise in audio engineering.

The Evolution from Visual to Auditory Segmentation

The original Segment Anything Model, released by Meta in 2023, revolutionized computer vision by enabling zero-shot segmentation of any object in an image or video through point, box, or mask prompts. SAM 2 further advanced this capability into the video domain, supporting real-time object tracking and segmentation across frames. AudioSAM builds directly on SAM 2’s architecture, adapting it to handle audio signals by integrating spectrogram visualizations—time-frequency representations of sound—as the primary input modality.

In essence, AudioSAM treats audio segmentation as a visual task. It converts raw audio waveforms into spectrograms, which are 2D images depicting frequency content over time. Users interact with these spectrograms much like they would with images in SAM: by clicking on regions corresponding to specific sounds. The model then generates masks that delineate the desired audio source, effectively separating it from the background noise or other overlapping sounds. This approach leverages the proven efficacy of vision transformers in SAM while addressing the unique challenges of audio, such as overlapping frequencies and transient events.

User Interface and Interaction Paradigms

AudioSAM’s interface is intuitive and browser-based, requiring no specialized software installation. Upon uploading a video file, the tool automatically extracts and displays the audio spectrogram alongside the video waveform. Editors can then:

Point-and-Click Segmentation: Hover over the spectrogram and click on the visual “blob” representing a sound source, such as a speaking voice, musical instrument, or ambient noise. The model instantly generates a precise mask, highlighting the selected audio in both the spectrogram and waveform views.
Text-Prompt Segmentation: For hands-free operation, users input descriptive phrases like “male voice,” “guitar riff,” or “crowd cheers.” AudioSAM’s multimodal capabilities process these prompts to identify and isolate the corresponding sounds, even in complex mixtures.

The segmentation results are interactive: masks can be refined by adding multiple clicks or combining prompts. Once satisfied, users export the isolated audio track as a standalone WAV file, ready for further editing in digital audio workstations (DAWs) like Adobe Audition or Reaper.

A live demo showcases AudioSAM’s prowess on real-world videos. For instance, in a clip featuring a podcast with background music and echoey room tone, a single click isolates the host’s dialogue cleanly. Text prompts effortlessly extract elements like “drums” from a live band performance or “birdsong” from nature footage, demonstrating robustness across genres from speech to music and environmental sounds.

Technical Underpinnings and Training

At its core, AudioSAM employs a SAM 2-inspired image encoder-decoder pipeline, fine-tuned for spectrogram inputs. The model uses a masked autoencoder (MAE) pretraining strategy on large-scale audio-visual datasets, followed by supervised fine-tuning on labeled audio source separation benchmarks.

Key innovations include:

Cross-Modal Alignment: By training on paired audio-visual data, AudioSAM learns to correlate visual motion (e.g., lip movements) with auditory cues, enhancing segmentation accuracy in videos.
Prompt Encoder Enhancements: The text encoder, powered by a lightweight CLIP-like model, maps natural language to embedding spaces compatible with spectrogram features, enabling zero-shot text-to-audio segmentation.
Inference Efficiency: Optimized for real-time performance, it processes 10-second audio clips in under a second on consumer GPUs, with support for longer videos via streaming.

The model was evaluated on datasets like AudioSet and LibriMix, achieving state-of-the-art signal-to-distortion ratios (SDR) for source separation, outperforming traditional methods like Demucs or Spleeter in zero-shot scenarios.

Implications for Video Production and Beyond

AudioSAM lowers the barrier to professional-grade audio post-production. Traditionally, isolating sounds required manual spectral editing, noise gates, or machine learning tools demanding clean training data. Now, editors can non-destructively “cut out” sounds directly from source videos, streamlining workflows in film, YouTube content creation, and podcasting.

Its open-source nature accelerates adoption: the codebase, pretrained weights, and demo are available on GitHub under a permissive license. Researchers can extend it for niche applications, such as forensic audio analysis or hearing aid enhancement.

While still in early research stages, AudioSAM hints at a future where multimodal AI seamlessly blends sight and sound. As video content proliferates, tools like this could transform how we manipulate and remix audio, fostering creativity without compromising quality.

Gnoppix is the leading open-source AI Linux distribution and service provider. Since implementing AI in 2022, it has offered a fast, powerful, secure, and privacy-respecting open-source OS with both local and remote AI capabilities. The local AI operates offline, ensuring no data ever leaves your computer. Based on Debian Linux, Gnoppix is available with numerous privacy- and anonymity-enabled services free of charge.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.