logo
RunComfy
  • ComfyUI
  • TrainerNew
  • Models
  • API
  • Pricing
discord logo
ComfyUI>Workflows>Woosh Sound Effect Generation | Text2Audio + VideoSync

Woosh Sound Effect Generation | Text2Audio + VideoSync

Workflow Name: RunComfy/Woosh-Sound-Effect-Generation
Workflow ID: 0000...1392
This workflow helps you craft perfectly-timed sound effects directly from text cues or video scenes. Featuring Sony AI’s Woosh model, it lets you switch quickly between speed and quality modes. You can build immersive Foley or generate synchronized audio without complex setups. Designed for creators seeking effortless sound design workflows. The system aligns generated audio precisely to visual motion for better storytelling. You gain full creative control with a single streamlined interface.

Woosh Sound Effect Generation: prompt and video‑conditioned audio in ComfyUI

Woosh Sound Effect Generation is a ComfyUI workflow that turns either text prompts or video clips into polished sound effects using Sony Research’s Woosh foundation model. It is built for creators who need one place for prompt‑based Foley, tightly video‑matched sound design, and quick switching between high‑quality and fast distilled variants.

The workflow exposes both families of Woosh models: Flow/DFlow for text‑to‑audio and VFlow/DVFlow for video‑to‑audio. A shared sampler drives generation in both paths, outputting audio for immediate preview and, in the video path, frame previews that are recombined for quick dailies. Under the hood it relies on the official ComfyUI Woosh nodes and VideoHelperSuite for seamless video IO, so Woosh Sound Effect Generation stays fast and simple while remaining flexible. References: SonyResearch/Woosh, drbaph/Woosh on Hugging Face, paper, ComfyUI-Woosh, ComfyUI-VideoHelperSuite.

Key models in Comfyui Woosh Sound Effect Generation workflow

  • Sony Research Woosh — Flow: core text‑to‑audio generator used for high‑fidelity Foley and ambience, trained with flow‑matching objectives. See SonyResearch/Woosh and the paper.
  • Sony Research Woosh — DFlow: distilled text‑to‑audio model optimized for speed with far fewer sampling steps, ideal for rapid iteration. Weights are available via drbaph/Woosh.
  • Sony Research Woosh — VFlow‑8s: video‑conditioned generator that synchronizes audio onsets and textures to visual motion cues for video‑to‑audio. See SonyResearch/Woosh.
  • Sony Research Woosh — DVFlow‑8s: distilled video‑to‑audio model for real‑time‑lean workflows and quick previews. Weights: drbaph/Woosh.
  • Woosh‑AE: the audio autoencoder used to reconstruct waveforms from model latents; required by all generators. Weights: drbaph/Woosh.
  • TextConditionerA and TextConditionerV: text conditioning modules that embed prompts appropriately for text‑to‑audio or video‑to‑audio runs. Details and usage are documented in ComfyUI-Woosh and the paper.

How to use Comfyui Woosh Sound Effect Generation workflow

This workflow has two parallel groups that you can run independently: Video to Audio for visual‑matched sound design and Text to Audio for pure prompt‑based Foley. Both converge on the same sampler logic and quick audio preview, making Woosh Sound Effect Generation consistent to operate regardless of input.

Video to Audio

The Video to Audio group loads a clip, aligns frames and conditioning, then generates synchronized sound. Start by feeding your clip into VHS_LoadVideo (#34); it extracts frames at your chosen rate so downstream nodes see a clean, bounded sequence. Those frames are packed as a video conditioning stream by WooshLoadVideo (#37), which standardizes duration so the generator receives steady windows.

Choose a video‑conditioned model in WooshLoadFlow (#7), typically VFlow for fidelity or DVFlow for speed. Provide a short descriptive prompt in the sampler (for style or intent) and set WooshTextEncode (#19) to V2A so text is embedded with the correct conditioning branch. Run WooshSample (#38) to synthesize audio; it outputs both audio for PreviewAudio (#9) and video_frames that flow into VHS_VideoCombine (#33) for a quick stitched preview, keeping Woosh Sound Effect Generation tight for editorial review.

Text to Audio

The Text to Audio group focuses on clean prompt‑driven generation. Select a model in WooshLoadFlow (#40), using Flow when you want maximum quality and DFlow when you need very fast, iterative passes. Set WooshTextEncode (#41) to T2A so your prompt is embedded for text‑only generation. Enter your description in WooshSample (#39) and execute; the result is sent to PreviewAudio (#43) for instant listening. This path keeps Woosh Sound Effect Generation lightweight when you are crafting libraries or layering effects without picture.

Key nodes in Comfyui Woosh Sound Effect Generation workflow

WooshSample (#38)

Central sampler for video‑conditioned generation. Adjust the prompt to steer style and onsets, then tune steps for the quality–speed tradeoff (use fewer steps when running DVFlow). cfg controls prompt adherence, and latent_frames determines output length so it matches or intentionally offsets the clip. Set seed to reproduce takes, and enable force_offload when you need to clear memory between long runs. Node implementation and behavior follow the official ComfyUI-Woosh.

WooshSample (#39)

Sampler for text‑to‑audio with the same controls and behavior, minus the video stream. For quick ideation pick DFlow and low steps; for finals switch to Flow and raise steps for detail. Keep cfg moderate for natural textures, nudge higher for stylized, prompt‑locked outcomes. Use latent_frames to set duration precisely when building assets for libraries or DAW timelines.

WooshLoadFlow (#7)

Model selector for the Video to Audio path. Choose VFlow for highest fidelity alignment to motion, or DVFlow when you need near‑real‑time previews. Ensure WooshTextEncode is set to V2A so embeddings match the chosen model family. See drbaph/Woosh for model variants.

WooshLoadFlow (#40)

Model selector for the Text to Audio path. Pick Flow for rich detail and wider texture variety, or DFlow for rapid iteration with minimal steps. Pair it with WooshTextEncode in T2A mode to avoid conditioning mismatches. Node behavior and options track the official ComfyUI-Woosh.

VHS_VideoCombine (#33)

Utility for assembling the generated audio with the video_frames preview from the sampler to produce a reviewable clip. Use it to spot sync, evaluate transitions, and share dailies without leaving ComfyUI. Part of ComfyUI-VideoHelperSuite.

Optional extras

  • Use DVFlow/DFlow for quick scouting passes, then switch to VFlow/Flow for finals when Woosh Sound Effect Generation must shine.
  • Keep your input clip within the selected model’s window (for example, the 8‑second VFlow variants) and process longer scenes in overlapping chunks you can crossfade.
  • Maintain a consistent frame rate from VHS_LoadVideo through VHS_VideoCombine to reduce drift between audio and picture.
  • For prompts, pair action words with texture and acoustic context (for example, “fast metallic whoosh in a concrete stairwell”) to get predictable results.
  • Turn on force_offload in the sampler between heavy runs if GPU memory is tight.

Acknowledgements

This workflow implements and builds upon the following works and resources. We gratefully acknowledge Sony Research for Woosh (project and paper), Saganaki22 for ComfyUI-Woosh (ComfyUI node), and Kosinkadink for ComfyUI-VideoHelperSuite for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources

  • Saganaki22/ComfyUI-Woosh
    • GitHub: Saganaki22/ComfyUI-Woosh
  • drbaph/Woosh
    • Hugging Face: drbaph/Woosh
  • SonyResearch/Woosh
    • GitHub: SonyResearch/Woosh
  • Sony Research/Woosh (paper)
    • arXiv: 2502.07359
  • Kosinkadink/ComfyUI-VideoHelperSuite
    • GitHub: Kosinkadink/ComfyUI-VideoHelperSuite

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

MMAudio | Video-to-Audio

MMAudio: Advanced video-to-audio model for high-quality audio generation.

Stable Audio Open 1.0 | Text-to-Music Tool

Turns text prompts into cinematic music seamlessly and fast.

ACE-Step Music Generation | AI Audio Creation

Generate studio-quality music 15× faster with breakthrough diffusion technology.

EchoMimic | Audio-driven Portrait Animations

Generate realistic talking heads and body gestures synced with the provided audio.

Fish Audio S2 TTS | Expressive Voice Generator

Make lifelike speech with emotions, styles, and cloned voices fast.

Qwen Image Edit | Precise AI Photo Editing

Edit photos fast with style, relighting, and object control precision.

Qwen Edit 2509 MultipleAngles | Multi-View Image Creator

Turn one photo into complete multi-angle visuals instantly.

DiffuEraser | Video Inpainting

Erase objects from videos with auto-masking and realistic reconstruction.

Follow us
  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
Support
  • Discord
  • Email
  • System Status
  • Affiliate
Resources
  • Free ComfyUI Online
  • ComfyUI Guides
  • RunComfy API
  • ComfyUI Tutorials
  • ComfyUI Nodes
  • Learn More
Legal
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
RunComfy
Copyright 2026 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.