ComfyUI  >  Nodes  >  ComfyUI Whisper >  Apply Whisper

ComfyUI Node: Apply Whisper

Class Name

Apply Whisper

yuvraj108c (Account age: 2153 days)
ComfyUI Whisper
Latest Updated
Github Stars

How to Install ComfyUI Whisper

Install this extension via the ComfyUI Manager by searching for  ComfyUI Whisper
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI Whisper in the search bar
After installation, click the  Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • High-speed GPU machines
  • 200+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 50+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

Apply Whisper Description

Transcribe audio to text with high accuracy and precise timing using Whisper model for AI artists.

Apply Whisper:

The Apply Whisper node is designed to transcribe audio files into text using the Whisper model, a state-of-the-art speech recognition system. This node is particularly useful for AI artists who need to convert spoken words into written text for further processing, such as adding subtitles to videos or creating text-based content from audio recordings. By leveraging the Whisper model, the node ensures high accuracy in transcription, capturing not only the text but also the precise timing of each word and segment. This detailed alignment information can be invaluable for synchronizing subtitles with audio or for any application requiring precise timing data.

Apply Whisper Input Parameters:


The audio parameter expects an input of type VHS_AUDIO. This parameter represents the audio data that you want to transcribe. The audio data should be provided in a format that the node can process, typically as a byte stream. The quality and clarity of the audio can significantly impact the accuracy of the transcription, so it is advisable to use clear and noise-free recordings.


The model parameter allows you to select the specific Whisper model to use for transcription. The available options are base, tiny, small, medium, and large. Each model varies in size and accuracy, with larger models generally providing more accurate transcriptions but requiring more computational resources. The choice of model can affect the speed and accuracy of the transcription process, so you should select the model that best fits your needs and available resources.

Apply Whisper Output Parameters:


The text output parameter provides the transcribed text from the input audio. This is the main output of the node and contains the entire spoken content converted into written form. The text is stripped of any leading or trailing whitespace to ensure clean and accurate results.


The segments_alignment output parameter is a list of dictionaries, each representing a segment of the transcribed text. Each dictionary contains the value (the transcribed text of the segment), start (the start time of the segment in the audio), and end (the end time of the segment). This detailed alignment information is useful for applications that require precise synchronization of text with audio, such as subtitle generation.


The words_alignment output parameter is a list of dictionaries, each representing a word in the transcribed text. Each dictionary contains the value (the transcribed word), start (the start time of the word in the audio), and end (the end time of the word). This fine-grained alignment data is essential for tasks that need exact word-level timing, such as creating karaoke-style lyrics or detailed subtitle tracks.

Apply Whisper Usage Tips:

  • For optimal transcription accuracy, ensure that your audio input is clear and free from background noise.
  • Choose the Whisper model that best fits your needs; larger models like large offer higher accuracy but require more computational power.
  • Utilize the segments_alignment and words_alignment outputs to create precisely timed subtitles or to analyze the timing of spoken words in your audio.

Apply Whisper Common Errors and Solutions:

"File not found" error

  • Explanation: This error occurs if the audio file cannot be saved to the temporary directory.
  • Solution: Ensure that the temporary directory is writable and that there is sufficient disk space.

"Model loading failed" error

  • Explanation: This error occurs if the specified Whisper model cannot be loaded.
  • Solution: Verify that the model name is correct and that the necessary model files are available and accessible.

"Transcription failed" error

  • Explanation: This error occurs if the Whisper model fails to transcribe the audio.
  • Solution: Check the quality of the input audio and ensure it is in a supported format. If the problem persists, try using a different Whisper model.

Apply Whisper Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI Whisper

© Copyright 2024 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.