ComfyUI
Playground
Pricing

RunComfy

IC-Light | Image Relighting

Edit backgrounds, enhance lighting, and regenerate new scenes easily.

Flux PuLID for Face Swapping

Take your face swapping projects to new heights with Flux PuLID.

FLUX | A New Art Image Generation

A new image generation model developed by Black Forest Labs

Flux Fill | Inpaint and Outpaint

Official Flux Tools - Flux Fill for Inpainting and Outpainting

ComfyUI > Nodes > ComfyUI-EdgeTTS > Whisper STT 👂

ComfyUI Node: Whisper STT 👂

Class Name

WhisperSTT

Category
🧪AILab/🔊Audio

Author
1038lab (Account age: 774days) Extension
ComfyUI-EdgeTTS Latest Updated
2025-04-18 Github Stars
0.04K

Github Ask 1038lab Current Questions Past Questions

Table of Content

Description
WhisperSTT:
WhisperSTT Input Parameters:
WhisperSTT Output Parameters:
WhisperSTT Usage Tips:
WhisperSTT Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-EdgeTTS

Install this extension via the ComfyUI Manager by searching for ComfyUI-EdgeTTS

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-EdgeTTS in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

Whisper STT 👂 Description

Facilitates speech-to-text conversion using Whisper model in ComfyUI, supporting multiple languages and model sizes for accuracy.

Whisper STT 👂:

WhisperSTT is a node designed to facilitate speech-to-text (STT) conversion using the Whisper model, a state-of-the-art speech recognition system. This node is integrated into the ComfyUI environment, providing a seamless way to transcribe audio inputs into text. The primary benefit of WhisperSTT is its ability to handle various audio inputs and convert them into accurate text representations, making it an invaluable tool for AI artists who need to process spoken content into written form. The node supports multiple languages and can automatically detect the language of the input audio, enhancing its versatility. By leveraging different model sizes, WhisperSTT allows you to balance between speed and accuracy, catering to different performance needs. This node is particularly useful for projects that require converting audio data into text for further processing or analysis.

Whisper STT 👂 Input Parameters:

audio

The audio parameter is the primary input for the WhisperSTT node, representing the audio data that you wish to transcribe. This parameter should be provided in a format that includes a waveform and a sample rate, which are essential for accurate transcription. The waveform is the actual audio signal, while the sample rate indicates how many samples per second are in the audio. This parameter is crucial as it directly affects the quality and accuracy of the transcription output.

model_size

The model_size parameter allows you to select the size of the Whisper model used for transcription. Available options are "tiny", "base", "small", "medium", and "large". The default value is "base". Larger models generally provide more accurate transcriptions but require more computational resources and time. This parameter is important for balancing the trade-off between transcription accuracy and processing speed, depending on your specific needs and available resources.

language

The language parameter specifies the language of the audio input. You can choose from a list of supported languages or select "auto" for automatic language detection. The default value is "auto". This parameter is significant because it ensures that the transcription process is tailored to the correct language, which is essential for achieving high accuracy in the transcribed text.

Whisper STT 👂 Output Parameters:

STRING

The output of the WhisperSTT node is a STRING, which contains the transcribed text from the input audio. This output is the result of the speech-to-text conversion process and represents the spoken content in a written format. The accuracy and quality of this output depend on the input parameters and the characteristics of the audio data. This transcribed text can be used for various applications, such as creating subtitles, generating text-based content, or further analysis.

Whisper STT 👂 Usage Tips:

For optimal accuracy, choose a larger model size if you have sufficient computational resources and time, especially for complex or noisy audio inputs.
Use the "auto" language detection feature if you are unsure of the audio's language, but specify the language manually if you know it to improve accuracy.
Ensure that the audio input is clear and has a high sample rate to enhance the quality of the transcription.

Whisper STT 👂 Common Errors and Solutions:

[Whisper STT] Error: `<error_message>`

Explanation: This error message indicates that an exception occurred during the transcription process. The specific error message will provide more details about the nature of the problem.
Solution: Check the audio input to ensure it is correctly formatted and contains valid waveform and sample rate data. Verify that the selected model size and language are appropriate for the input. If the error persists, consult the detailed error message for further troubleshooting steps.

[Whisper STT] Detected: unknown (conf: 0.00)

Explanation: This message indicates that the language detection feature was unable to confidently identify the language of the audio input.
Solution: Manually specify the language of the audio input if known, or try using a clearer audio sample to improve language detection accuracy.

Whisper STT 👂 Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-EdgeTTS

Table of Content

Description
WhisperSTT:
WhisperSTT Input Parameters:
WhisperSTT Output Parameters:
WhisperSTT Usage Tips:
WhisperSTT Common Errors and Solutions:
Related Nodes

HiDream-I1 | T2I

High-quality image generation using a 17B parameter model.

Nvidia Cosmos | Text & Image to Video Creation

Generate videos from text prompts or create frame interpolation between two images with Nvidia's Cosmos.

InfiniteYou | Identity-Preserving Face Generation

Dual-mode identity-preserving generation with Face Combine and Zero-Shot workflows using InfiniteYou.

Wonder3D | ComfyUI 3D Pack

Generate multi-view normal maps and color images for 3D assets.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy

ComfyUI Node: Whisper STT 👂

WhisperSTT

How to Install ComfyUI-EdgeTTS

Whisper STT 👂 Description

Whisper STT 👂:

Whisper STT 👂 Input Parameters:

audio

model_size

language

Whisper STT 👂 Output Parameters:

STRING

Whisper STT 👂 Usage Tips:

Whisper STT 👂 Common Errors and Solutions:

[Whisper STT] Error: <error_message>

[Whisper STT] Detected: unknown (conf: 0.00)

Whisper STT 👂 Related Nodes

[Whisper STT] Error: `<error_message>`