Wan 2.1 Video Restyle | Consistent Video Style Transform

Transform your video style by applying the restyled first frame using Wan 2.1 video restyle workflow.

Hunyuan Image to Video | Breathtaking Motion Creator

Create magnificent movies out of still images through cinematic motion and customizable effects.

Stable Diffusion 3.5

Stable Diffusion 3.5 (SD3.5) for high-quality, diverse image generation.

LivePortrait | Animate Portraits | Img2Vid

Animate portraits with facial expressions and motion using a single image and reference video.

ComfyUI > Nodes > ComfyUI-TranscriptionTools > Load Whisper Transcription Model

ComfyUI Node: Load Whisper Transcription Model

Class Name

TT-LoadWhisperModel

Category
transcription

Author
royceschultz (Account age: 2853days) Extension
ComfyUI-TranscriptionTools Latest Updated
2025-04-23 Github Stars
0.02K

Github Ask royceschultz Current Questions Past Questions

Table of Content

Description
TT-LoadWhisperModel:
TT-LoadWhisperModel Input Parameters:
TT-LoadWhisperModel Output Parameters:
TT-LoadWhisperModel Usage Tips:
TT-LoadWhisperModel Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-TranscriptionTools

Install this extension via the ComfyUI Manager by searching for ComfyUI-TranscriptionTools

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-TranscriptionTools in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

Load Whisper Transcription Model Description

Facilitates loading and configuring Whisper models for speech recognition tasks, optimizing performance and language-specific settings.

Load Whisper Transcription Model:

The TT-LoadWhisperModel node is designed to facilitate the loading and configuration of Whisper models for automatic speech recognition tasks. This node leverages the capabilities of the Whisper models developed by OpenAI, which are known for their robust performance in transcribing spoken language into text. The primary function of this node is to load a specified Whisper model and prepare it for transcription tasks by setting up the necessary processing pipeline. It automatically determines the appropriate computational resources, such as using GPU if available, to optimize performance. The node also allows for language-specific configurations, ensuring that the model is tailored to the desired transcription language. This makes it an essential tool for AI artists and developers who need to integrate speech-to-text functionalities into their projects, providing a seamless and efficient way to handle audio transcription.

Load Whisper Transcription Model Input Parameters:

model_id

The model_id parameter specifies the identifier of the Whisper model to be loaded. It determines which version of the Whisper model will be used for transcription. The available options include various sizes and versions of the Whisper model, such as openai/whisper-large-v3, openai/whisper-medium, and openai/whisper-tiny, among others. Each model varies in size and capability, with larger models generally offering more accurate transcriptions at the cost of higher computational requirements. There are also language-specific models, such as openai/whisper-medium.en, which are optimized for English. Selecting the appropriate model depends on the specific needs of your transcription task, balancing accuracy and resource usage.

language

The language parameter defines the language setting for the transcription process. It can be set to en for English, fr for French, or auto to allow the model to automatically detect the language of the input audio. This parameter is crucial for ensuring that the transcription is accurate and that the model is configured correctly for the language of the audio content. If a language-specific model is chosen (e.g., a model ending in .en), the language parameter must match the model's language capability, otherwise, an error will be raised. This parameter allows for flexibility in handling multilingual audio content and ensures that the transcription output is as accurate as possible.

Load Whisper Transcription Model Output Parameters:

TRANSCRIPTION_PIPELINE

The TRANSCRIPTION_PIPELINE output parameter provides the configured pipeline ready for performing automatic speech recognition. This pipeline is a comprehensive setup that includes the loaded Whisper model, tokenizer, and feature extractor, all configured to process audio input and generate transcriptions. The pipeline is optimized for performance, utilizing GPU resources if available, and is capable of handling various audio lengths and batch sizes. It returns transcriptions along with timestamps, making it suitable for applications that require detailed analysis of audio content. This output is essential for integrating speech-to-text capabilities into your projects, providing a ready-to-use solution for audio transcription tasks.

Load Whisper Transcription Model Usage Tips:

Choose the model size based on your resource availability and accuracy needs; larger models provide better accuracy but require more computational power.
Use the auto language setting for audio with unknown or mixed languages to let the model automatically detect and transcribe the content accurately.
Ensure that your environment has GPU support enabled to take full advantage of the model's capabilities and improve transcription speed.

Load Whisper Transcription Model Common Errors and Solutions:

ValueError: Model `<model_id>` only supports English language

Explanation: This error occurs when a language-specific model (e.g., one ending in .en) is used with a language setting that does not match its capabilities.
Solution: Ensure that the language parameter is set to en when using an English-specific model, or choose a model that supports the desired language.

RuntimeError: CUDA out of memory

Explanation: This error indicates that the GPU does not have enough memory to load and run the selected model.
Solution: Try using a smaller model or reduce the batch size to decrease memory usage. Alternatively, ensure that no other processes are using the GPU resources.

ImportError: No module named 'torch'

Explanation: This error suggests that the PyTorch library, which is required for running the model, is not installed in your environment.
Solution: Install PyTorch by running pip install torch in your command line or terminal to resolve this issue.

Load Whisper Transcription Model Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-TranscriptionTools

Table of Content

Description
TT-LoadWhisperModel:
TT-LoadWhisperModel Input Parameters:
TT-LoadWhisperModel Output Parameters:
TT-LoadWhisperModel Usage Tips:
TT-LoadWhisperModel Common Errors and Solutions:
Related Nodes

MultiTalk | Photo to Talking Video

Millisecond lip sync + Wan2.1 = 15s ultra-detailed talking videos!

Flux Consistent Characters | Input Text

Create consistent characters and ensure they look uniform by inputting text.

VACE Wan2.1 | V2V

Transform videos with a reference style image using VACE Wan2.1.

Consistent & Realistic Characters

Create consistent and realistic characters with precise control over facial features, poses, and compositions.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.