Audio Transcription (Real-time)

Real-time audio transcription node using faster-whisper model with VAD for efficient processing.

Audio Transcription (Real-time):

The AudioTranscriptionNode is a powerful tool designed for real-time audio transcription within the ComfyUI framework. Its primary function is to buffer audio segments and transcribe them into text using the faster-whisper model, ensuring that the transcription process is both efficient and timely. This node is particularly beneficial for applications requiring immediate transcription feedback, as it controls the output timing to prevent message flooding, making it ideal for live audio processing scenarios. By resampling audio to 16kHz and utilizing advanced features like Voice Activity Detection (VAD), it ensures high-quality transcription while minimizing unnecessary processing of silent segments. The node's design allows for flexibility in transcription quality and speed, catering to various user needs, from fast response to high-quality outputs.

Audio Transcription (Real-time) Input Parameters:

audio

This parameter represents the audio data that needs to be transcribed. It is crucial as it serves as the primary input for the transcription process. The audio data should be in a format that can be processed by the node, typically as a numpy array after conversion if necessary.

transcription_interval

This parameter defines the minimum number of seconds between transcription outputs, with a default value of 2.0 seconds. It can range from 1.0 to 10.0 seconds, adjustable in 0.5-second increments. This setting is optimized for real-time transcription, allowing you to control how frequently transcriptions are outputted, balancing between responsiveness and processing load.

accumulation_duration

This parameter specifies the duration for which audio is accumulated before transcription, with a default of 3.0 seconds. It ranges from 2.0 to 10.0 seconds, adjustable in 0.5-second increments. A shorter duration results in faster output, while a longer duration can improve transcription quality by providing more context to the model.

whisper_model

This parameter allows you to select the size of the Whisper model used for transcription, with options including "tiny", "base", "small", "medium", and "large-v2". The default is "base". Larger models offer more accurate transcriptions but require more computational resources and time.

language

This parameter sets the language for transcription, with options such as "auto", "en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", and "zh". The default is "auto", which enables automatic language detection. Specifying a language can improve transcription accuracy if the language is known in advance.

enable_vad

This boolean parameter toggles Voice Activity Detection, which is enabled by default. VAD helps filter out silence from the audio, ensuring that only segments with speech are processed, thereby improving efficiency and reducing unnecessary computation.

Audio Transcription (Real-time) Output Parameters:

STRING

The output of the AudioTranscriptionNode is a string containing the transcribed text from the audio input. This output is crucial as it represents the final result of the transcription process, providing a textual representation of the spoken content in the audio. The transcription is formatted according to the specified output format, such as JSON segments, which can be further processed or displayed as needed.

Audio Transcription (Real-time) Usage Tips:

To achieve fast response times, set the accumulation_duration to 2.0 seconds and adjust the audio_chunk_size_ms to 1000.0 milliseconds.
For a balanced approach between speed and quality, consider setting the accumulation_duration to 4.0 seconds and audio_chunk_size_ms to 2000.0 milliseconds.
If high transcription quality is a priority, increase the accumulation_duration to 8.0 seconds and audio_chunk_size_ms to 4000.0 milliseconds.
Utilize the language parameter to specify the language of the audio if known, as this can enhance transcription accuracy.

Audio Transcription (Real-time) Common Errors and Solutions:

"Received empty audio data, returning empty"

Explanation: This error occurs when the node receives audio data that is either None or has no size, indicating that there is no valid audio input to process.
Solution: Ensure that the audio input is correctly formatted and contains valid data before passing it to the node.

"Audio resampling to 16Khz failed, returning empty"

Explanation: This error indicates a failure in the process of resampling the audio data to 16kHz, which is necessary for optimal performance of the Whisper model.
Solution: Check the format and integrity of the audio data to ensure it is compatible with the resampling process. Consider verifying the sample rate and data type of the input audio.

ComfyUI Node: Audio Transcription (Real-time)

AudioTranscriptionNode

How to Install ComfyUI-Stream-Pack

Audio Transcription (Real-time) Description