ComfyUI-OmniVoice-TTS Introduction
ComfyUI-OmniVoice-TTS is an advanced extension designed to bring the power of text-to-speech (TTS) technology to AI artists. This extension allows you to generate high-quality speech from text in over 600 languages, making it one of the most versatile TTS tools available. Whether you're looking to clone a voice from a short audio sample or design a completely new voice using textual descriptions, ComfyUI-OmniVoice-TTS has you covered. It supports voice cloning, voice design, and multi-speaker dialogues, providing a comprehensive solution for creating diverse and expressive audio content.
How ComfyUI-OmniVoice-TTS Works
At its core, ComfyUI-OmniVoice-TTS uses a diffusion language model architecture to convert text into speech. This model works by iteratively refining the audio output, similar to how an artist might start with a rough sketch and gradually add details to create a finished piece. The extension can clone voices by analyzing a short reference audio clip and then using that analysis to generate new speech in the same voice. For voice design, it allows you to specify attributes like gender, age, and accent to create a custom voice without needing a reference audio. The extension also supports non-verbal expressions and pronunciation adjustments, making it a flexible tool for creating nuanced audio content.
ComfyUI-OmniVoice-TTS Features
- Multilingual Support: Generate speech in over 600 languages, making it ideal for global projects.
- Voice Cloning: Clone any voice using a 3-15 second audio sample, perfect for creating consistent character voices.
- Voice Design: Create unique voices by specifying attributes such as gender, age, pitch, and accent.
- Multi-Speaker Dialogues: Use
[Speaker_N]:tags to generate conversations between multiple speakers. - Fast Inference: Achieve real-time performance with a response time factor as low as 0.025.
- Non-Verbal Expressions: Add expressions like laughter or sighs directly into the text for more dynamic audio.
- Automatic Model Download: Models are automatically downloaded from HuggingFace when first used, simplifying setup.
- Efficient Memory Usage: Features like automatic CPU offloading and smart caching help manage memory effectively.
ComfyUI-OmniVoice-TTS Models
ComfyUI-OmniVoice-TTS offers different models to suit various needs:
- OmniVoice: A full precision model (~4GB) supporting over 600 languages, ideal for high-quality output.
- OmniVoice-bf16: A bfloat16 quantized model (~2GB) that uses less memory, suitable for environments with limited resources. Additionally, Whisper models are available for automatic speech recognition, which can be used to transcribe reference audio for voice cloning.
Troubleshooting ComfyUI-OmniVoice-TTS
Here are some common issues and solutions:
- Model Download Failures: If you're in China, set the HuggingFace mirror before starting ComfyUI:
export HF_ENDPOINT="https://hf-mirror.com". - Whisper Model Re-downloads: Connect the
OmniVoice Whisper Loaderto thewhisper_modelinput to cache the model. - CUDA Memory Errors: Try setting
keep_model_loaded = False, usingdtype = fp16orbf16, or switching todevice = cpu. - Import Errors After Installation: Restart ComfyUI to reload Python modules.
- Transformers Version Issues: Ensure you have
transformers>=5.3.0. Upgrade if necessary, but be cautious as it may affect other nodes. For more detailed troubleshooting, refer to the troubleshooting guide.
Learn More about ComfyUI-OmniVoice-TTS
To further explore the capabilities of ComfyUI-OmniVoice-TTS, you can visit the Hugging Face Space for demos and additional resources. The GitHub repository is also a valuable resource for updates and community support. For a deeper dive into the technical aspects, the arXiv paper provides an in-depth look at the underlying technology.
