ComfyUI > Nodes > ComfyUI-Qwen3-ASR > Qwen3 ASR Transcriber

ComfyUI Node: Qwen3 ASR Transcriber

Class Name

Qwen3ASRTranscriber

Category
Qwen3-ASR
Author
kaushiknishchay (Account age: 3782days)
Extension
ComfyUI-Qwen3-ASR
Latest Updated
2026-03-05
Github Stars
0.01K

How to Install ComfyUI-Qwen3-ASR

Install this extension via the ComfyUI Manager by searching for ComfyUI-Qwen3-ASR
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI-Qwen3-ASR in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • 16GB VRAM to 80GB VRAM GPU machines
  • 400+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 200+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

Qwen3 ASR Transcriber Description

Qwen3ASRTranscriber converts audio to text, supports multiple languages, and provides timestamps.

Qwen3 ASR Transcriber:

The Qwen3ASRTranscriber is a powerful node designed to perform automatic speech recognition (ASR) using the Qwen3-ASR models. Its primary function is to convert spoken language in audio files into written text, making it an invaluable tool for tasks that require transcription of audio content. This node is capable of handling various languages and can automatically detect the language of the audio if needed. It supports processing audio in chunks, which is particularly useful for long recordings, ensuring efficient and accurate transcription. Additionally, the node can generate word-level timestamps when a forced aligner configuration is used, providing precise timing information for each word in the transcription. This feature is especially beneficial for applications that require synchronization of text with audio, such as subtitling or detailed analysis of speech patterns.

Qwen3 ASR Transcriber Input Parameters:

audio

The audio parameter is the input audio file that you want to transcribe. It should be provided in a specific format where the waveform is represented as a list with dimensions [Batch, Channels, Samples], and the sample rate is an integer. This parameter is crucial as it serves as the primary data source for the transcription process.

model_name

The model_name parameter specifies which Qwen3 ASR model to use for the transcription task. This allows you to choose from a list of available models, each potentially optimized for different languages or types of audio content. Selecting the appropriate model can significantly impact the accuracy and quality of the transcription.

language

The language parameter determines the language of the audio content. You can set it to a specific language from the supported list or choose "auto" for automatic language detection. This flexibility ensures that the transcription process is tailored to the linguistic characteristics of the audio, enhancing the accuracy of the output.

device

The device parameter indicates the hardware on which the ASR model will run. You can choose between "cuda" for GPU processing or "cpu" for CPU processing. The default is "cuda," which is generally faster and more efficient for large-scale transcription tasks, provided you have a compatible GPU.

precision

The precision parameter defines the numerical precision used during the ASR model's execution. Options include "bf16," "fp16," and "fp32," with "bf16" as the default. This setting can affect both the speed and memory usage of the transcription process, with lower precision typically offering faster performance at the cost of some accuracy.

max_new_tokens

The max_new_tokens parameter sets the maximum number of tokens that the transcription can generate. It ranges from 1 to 4096, with a default value of 256. This parameter controls the length of the transcription output, allowing you to manage the verbosity and detail level of the transcribed text.

flash_attention_2

The flash_attention_2 parameter is a boolean option that, when enabled, activates Flash Attention 2. This feature can speed up the inference process and reduce VRAM usage, making it beneficial for handling large audio files or when working with limited hardware resources.

chunk_size

The chunk_size parameter specifies the duration, in seconds, of each audio chunk to be processed. It ranges from 0 to 300 seconds, with a default of 30 seconds. Chunking is essential for managing long audio files, as it breaks them into smaller, more manageable segments, ensuring that the transcription process remains efficient and accurate.

overlap

The overlap parameter defines the overlap duration, in seconds, between consecutive audio chunks. It ranges from 0 to 10 seconds, with a default of 2 seconds. Overlapping helps maintain context between chunks, improving the coherence and continuity of the transcription, especially in cases where sentences or phrases span multiple chunks.

Qwen3 ASR Transcriber Output Parameters:

text

The text output parameter provides the transcribed text from the input audio. This is the primary output of the node, delivering the spoken content in a written format. The accuracy and quality of this text depend on the model used and the clarity of the audio input.

timestamps

The timestamps output parameter contains the timing information for each word or phrase in the transcription, formatted as start and end times. This output is particularly useful when a forced aligner is used, as it allows for precise synchronization of the text with the audio, facilitating applications like subtitling or detailed speech analysis.

Qwen3 ASR Transcriber Usage Tips:

  • For optimal performance, ensure that your audio files are clear and free from excessive background noise, as this can significantly impact transcription accuracy.
  • When working with long audio files, use chunking to break the audio into smaller segments. This not only improves processing efficiency but also enhances the accuracy of the transcription by maintaining context.
  • If you have a compatible GPU, set the device parameter to "cuda" to leverage faster processing speeds, especially for large-scale transcription tasks.
  • Enable flash_attention_2 if you are working with limited VRAM resources, as it can help reduce memory usage without compromising too much on performance.

Qwen3 ASR Transcriber Common Errors and Solutions:

"Model not found in cache"

  • Explanation: This error occurs when the specified ASR model is not available in the cache, possibly due to an incorrect model_name or a cache clearing operation.
  • Solution: Verify that the model_name is correct and corresponds to an available model. If necessary, reload the model into the cache before attempting transcription again.

"Audio format not supported"

  • Explanation: This error indicates that the input audio does not meet the required format specifications, such as incorrect waveform dimensions or sample rate.
  • Solution: Ensure that the audio input is formatted correctly, with the waveform as a list of [Batch, Channels, Samples] and a valid integer sample rate. Resample the audio to 16000Hz if needed.

"Insufficient VRAM for processing"

  • Explanation: This error suggests that there is not enough VRAM available to process the audio with the current settings, particularly when using high precision or large models.
  • Solution: Reduce the precision setting to "fp16" or "bf16," enable flash_attention_2, or process the audio in smaller chunks to decrease VRAM usage.

Qwen3 ASR Transcriber Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI-Qwen3-ASR
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.