Install this extension via the ComfyUI Manager by searching
for VibeVoice ComfyUI
1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter VibeVoice ComfyUI in the search bar
After installation, click the Restart button to
restart ComfyUI. Then, manually
refresh your browser to clear the cache and access
the updated list of nodes.
Visit
ComfyUI Online
for ready-to-use ComfyUI environment
VibeVoice ComfyUI is a ComfyUI wrapper for the Microsoft VibeVoice TTS model, enabling single and multi-speaker support and text file loading for enhanced text-to-speech functionality.
VibeVoice-ComfyUI Introduction
VibeVoice-ComfyUI is an extension designed to integrate Microsoft's VibeVoice text-to-speech (TTS) model into ComfyUI workflows. This extension allows you to generate high-quality, expressive speech from text, supporting both single and multi-speaker scenarios. Whether you're creating podcasts, voiceovers, or any other audio content, VibeVoice-ComfyUI provides a seamless way to synthesize natural-sounding speech directly within your creative projects. It addresses common challenges in TTS, such as maintaining speaker consistency and handling long-form content, making it an invaluable tool for AI artists looking to enhance their audio productions.
How VibeVoice-ComfyUI Works
At its core, VibeVoice-ComfyUI leverages the VibeVoice model, which uses advanced speech tokenizers and a diffusion framework to generate speech. Think of it as a sophisticated storyteller that can read your text and bring it to life with voices that sound natural and engaging. The model understands the context and flow of dialogue, allowing it to produce speech that feels coherent and dynamic. By using a combination of language models and acoustic processing, VibeVoice-ComfyUI can handle complex tasks like multi-speaker conversations and voice cloning, where it mimics the characteristics of a given voice sample.
VibeVoice-ComfyUI Features
Core Functionality
Single Speaker TTS: Generate speech from text using a single voice, with the option to clone a specific voice from an audio sample.
Multi-Speaker Conversations: Create dialogues with up to four distinct speakers, each with their own voice.
Voice Cloning: Capture the essence of a voice from an audio sample and use it to generate new speech.
LoRA Support: Fine-tune voices with custom Low-Rank Adaptation (LoRA) adapters for personalized voice characteristics.
Voice Speed Control: Adjust the speaking rate to match your desired pace.
Text File Loading: Easily load scripts from text files for processing.
Automatic Text Chunking: Seamlessly handle long texts by breaking them into manageable chunks.
Custom Pause Tags: Insert pauses in speech to control pacing and emphasis.
Node Chaining: Connect multiple nodes to create complex workflows.
Interruption Support: Cancel operations at any point during the generation process.
Flexible Configuration: Customize parameters like temperature, sampling, and guidance scale to suit your needs.
Performance & Optimization
Attention Mechanisms: Choose from various attention types to optimize performance.
Diffusion Steps: Balance quality and speed by adjusting the number of processing steps.
Memory Management: Efficiently manage VRAM usage with automatic cleanup options.
Apple Silicon Support: Enjoy native GPU acceleration on Apple devices with M1/M2/M3 chips.
Quantization Options: Reduce VRAM usage with 8-bit and 4-bit quantization, maintaining audio quality.
VibeVoice-ComfyUI Models
VibeVoice-ComfyUI supports several models, each suited for different use cases:
VibeVoice-1.5B: A smaller model ideal for quick prototyping and single-speaker tasks, requiring around 6GB of VRAM.
VibeVoice-Large: Offers the highest quality for multi-speaker conversations, but requires more VRAM (~20GB).
VibeVoice-Large-Q8: Provides production-quality audio with reduced VRAM usage (~12GB), perfect for GPUs with 12GB VRAM.
VibeVoice-Large-Q4: Maximizes VRAM savings with minimal quality loss, suitable for lower-end GPUs.
Each model can be downloaded from HuggingFace, and they are automatically detected and managed within the ComfyUI environment.
What's New with VibeVoice-ComfyUI
Recent updates have introduced several enhancements:
Version 1.8.1: Fixed a critical bug in the bitsandbytes library affecting the Q8 model.
Version 1.8.0: Introduced the VibeVoice-Large-Q8 model, offering perfect audio quality with significant VRAM savings.
Version 1.7.0: Added dynamic 4-bit quantization for language models, improving speed and reducing VRAM usage.
Version 1.6.0: Removed automatic model downloading, giving users more control over model management.
These updates improve the extension's performance and flexibility, making it easier for AI artists to create high-quality audio content.
Troubleshooting VibeVoice-ComfyUI
If you encounter issues while using VibeVoice-ComfyUI, here are some common solutions:
Installation Issues: Ensure you're using the correct Python environment and restart ComfyUI after installation.
Generation Problems: For unstable voices, try using deterministic mode. Ensure multi-speaker text is formatted correctly with sequential speaker numbers.
Memory Constraints: Use smaller models like VibeVoice-1.5B for systems with limited VRAM.
For more detailed troubleshooting, refer to the ComfyUI logs and ensure all dependencies are correctly installed.
Learn More about VibeVoice-ComfyUI
To further explore VibeVoice-ComfyUI, consider the following resources:
Video Demo: Watch a demonstration of the extension in action here.
Community Forums: Join discussions and seek support from other AI artists and developers.
These resources provide valuable insights and support to help you make the most of VibeVoice-ComfyUI in your creative projects.
RunComfy is the
premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.
RunComfy also provides AI Models,
enabling artists to harness the latest AI tools to create incredible art.