ComfyUI > Nodes > ComfyUI-Stream-Pack > Audio Transcription (Real-time)

ComfyUI Node: Audio Transcription (Real-time)

Class Name

AudioTranscriptionNode

Category
audio_utils
Author
livepeer (Account age: 3364days)
Extension
ComfyUI-Stream-Pack
Latest Updated
2025-09-25
Github Stars
0.02K

How to Install ComfyUI-Stream-Pack

Install this extension via the ComfyUI Manager by searching for ComfyUI-Stream-Pack
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI-Stream-Pack in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • 16GB VRAM to 80GB VRAM GPU machines
  • 400+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 200+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

Audio Transcription (Real-time) Description

Real-time audio transcription node using faster-whisper model with VAD for efficient processing.

Audio Transcription (Real-time):

The AudioTranscriptionNode is a powerful tool designed for real-time audio transcription within the ComfyUI framework. Its primary function is to buffer audio segments and transcribe them into text using the faster-whisper model, ensuring that the transcription process is both efficient and timely. This node is particularly beneficial for applications requiring immediate transcription feedback, as it controls the output timing to prevent message flooding, making it ideal for live audio processing scenarios. By resampling audio to 16kHz and utilizing advanced features like Voice Activity Detection (VAD), it ensures high-quality transcription while minimizing unnecessary processing of silent segments. The node's design allows for flexibility in transcription quality and speed, catering to various user needs, from fast response to high-quality outputs.

Audio Transcription (Real-time) Input Parameters:

audio

This parameter represents the audio data that needs to be transcribed. It is crucial as it serves as the primary input for the transcription process. The audio data should be in a format that can be processed by the node, typically as a numpy array after conversion if necessary.

transcription_interval

This parameter defines the minimum number of seconds between transcription outputs, with a default value of 2.0 seconds. It can range from 1.0 to 10.0 seconds, adjustable in 0.5-second increments. This setting is optimized for real-time transcription, allowing you to control how frequently transcriptions are outputted, balancing between responsiveness and processing load.

accumulation_duration

This parameter specifies the duration for which audio is accumulated before transcription, with a default of 3.0 seconds. It ranges from 2.0 to 10.0 seconds, adjustable in 0.5-second increments. A shorter duration results in faster output, while a longer duration can improve transcription quality by providing more context to the model.

whisper_model

This parameter allows you to select the size of the Whisper model used for transcription, with options including "tiny", "base", "small", "medium", and "large-v2". The default is "base". Larger models offer more accurate transcriptions but require more computational resources and time.

language

This parameter sets the language for transcription, with options such as "auto", "en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", and "zh". The default is "auto", which enables automatic language detection. Specifying a language can improve transcription accuracy if the language is known in advance.

enable_vad

This boolean parameter toggles Voice Activity Detection, which is enabled by default. VAD helps filter out silence from the audio, ensuring that only segments with speech are processed, thereby improving efficiency and reducing unnecessary computation.

Audio Transcription (Real-time) Output Parameters:

STRING

The output of the AudioTranscriptionNode is a string containing the transcribed text from the audio input. This output is crucial as it represents the final result of the transcription process, providing a textual representation of the spoken content in the audio. The transcription is formatted according to the specified output format, such as JSON segments, which can be further processed or displayed as needed.

Audio Transcription (Real-time) Usage Tips:

  • To achieve fast response times, set the accumulation_duration to 2.0 seconds and adjust the audio_chunk_size_ms to 1000.0 milliseconds.
  • For a balanced approach between speed and quality, consider setting the accumulation_duration to 4.0 seconds and audio_chunk_size_ms to 2000.0 milliseconds.
  • If high transcription quality is a priority, increase the accumulation_duration to 8.0 seconds and audio_chunk_size_ms to 4000.0 milliseconds.
  • Utilize the language parameter to specify the language of the audio if known, as this can enhance transcription accuracy.

Audio Transcription (Real-time) Common Errors and Solutions:

"Received empty audio data, returning empty"

  • Explanation: This error occurs when the node receives audio data that is either None or has no size, indicating that there is no valid audio input to process.
  • Solution: Ensure that the audio input is correctly formatted and contains valid data before passing it to the node.

"Audio resampling to 16Khz failed, returning empty"

  • Explanation: This error indicates a failure in the process of resampling the audio data to 16kHz, which is necessary for optimal performance of the Whisper model.
  • Solution: Check the format and integrity of the audio data to ensure it is compatible with the resampling process. Consider verifying the sample rate and data type of the input audio.

Audio Transcription (Real-time) Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI-Stream-Pack
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.