ComfyUI > Nodes > ComfyUI-EdgeTTS > Whisper STT πŸ‘‚

ComfyUI Node: Whisper STT πŸ‘‚

Class Name

WhisperSTT

Category
πŸ§ͺAILab/πŸ”ŠAudio
Author
1038lab (Account age: 774days)
Extension
ComfyUI-EdgeTTS
Latest Updated
2025-04-18
Github Stars
0.04K

How to Install ComfyUI-EdgeTTS

Install this extension via the ComfyUI Manager by searching for ComfyUI-EdgeTTS
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI-EdgeTTS in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • 16GB VRAM to 80GB VRAM GPU machines
  • 400+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 200+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

Whisper STT πŸ‘‚ Description

Facilitates speech-to-text conversion using Whisper model in ComfyUI, supporting multiple languages and model sizes for accuracy.

Whisper STT πŸ‘‚:

WhisperSTT is a node designed to facilitate speech-to-text (STT) conversion using the Whisper model, a state-of-the-art speech recognition system. This node is integrated into the ComfyUI environment, providing a seamless way to transcribe audio inputs into text. The primary benefit of WhisperSTT is its ability to handle various audio inputs and convert them into accurate text representations, making it an invaluable tool for AI artists who need to process spoken content into written form. The node supports multiple languages and can automatically detect the language of the input audio, enhancing its versatility. By leveraging different model sizes, WhisperSTT allows you to balance between speed and accuracy, catering to different performance needs. This node is particularly useful for projects that require converting audio data into text for further processing or analysis.

Whisper STT πŸ‘‚ Input Parameters:

audio

The audio parameter is the primary input for the WhisperSTT node, representing the audio data that you wish to transcribe. This parameter should be provided in a format that includes a waveform and a sample rate, which are essential for accurate transcription. The waveform is the actual audio signal, while the sample rate indicates how many samples per second are in the audio. This parameter is crucial as it directly affects the quality and accuracy of the transcription output.

model_size

The model_size parameter allows you to select the size of the Whisper model used for transcription. Available options are "tiny", "base", "small", "medium", and "large". The default value is "base". Larger models generally provide more accurate transcriptions but require more computational resources and time. This parameter is important for balancing the trade-off between transcription accuracy and processing speed, depending on your specific needs and available resources.

language

The language parameter specifies the language of the audio input. You can choose from a list of supported languages or select "auto" for automatic language detection. The default value is "auto". This parameter is significant because it ensures that the transcription process is tailored to the correct language, which is essential for achieving high accuracy in the transcribed text.

Whisper STT πŸ‘‚ Output Parameters:

STRING

The output of the WhisperSTT node is a STRING, which contains the transcribed text from the input audio. This output is the result of the speech-to-text conversion process and represents the spoken content in a written format. The accuracy and quality of this output depend on the input parameters and the characteristics of the audio data. This transcribed text can be used for various applications, such as creating subtitles, generating text-based content, or further analysis.

Whisper STT πŸ‘‚ Usage Tips:

  • For optimal accuracy, choose a larger model size if you have sufficient computational resources and time, especially for complex or noisy audio inputs.
  • Use the "auto" language detection feature if you are unsure of the audio's language, but specify the language manually if you know it to improve accuracy.
  • Ensure that the audio input is clear and has a high sample rate to enhance the quality of the transcription.

Whisper STT πŸ‘‚ Common Errors and Solutions:

[Whisper STT] Error: <error_message>

  • Explanation: This error message indicates that an exception occurred during the transcription process. The specific error message will provide more details about the nature of the problem.
  • Solution: Check the audio input to ensure it is correctly formatted and contains valid waveform and sample rate data. Verify that the selected model size and language are appropriate for the input. If the error persists, consult the detailed error message for further troubleshooting steps.

[Whisper STT] Detected: unknown (conf: 0.00)

  • Explanation: This message indicates that the language detection feature was unable to confidently identify the language of the audio input.
  • Solution: Manually specify the language of the audio input if known, or try using a clearer audio sample to improve language detection accuracy.

Whisper STT πŸ‘‚ Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI-EdgeTTS
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.