RunComfy

ReActor | Fast Face Swap

Professional face swapping toolkit for ComfyUI that enables natural face replacement and enhancement.

Face Restore + ControlNet + Reactor | Restore Old Photos

Revive faded photos into vibrant memories, preserving every detail for cherished reminiscence.

FLUX Inpainting | Seamless Image Editing

Effortlessly fill, remove, and refine images, seamlessly integrating new content.

SeedVR2 V2.5 | AI Video Upscaling Workflow

Upscale videos fast with sharp, smooth, cinematic results.

ComfyUI > Nodes > ComfyUI > ElevenLabs Speech to Text

ComfyUI Node: ElevenLabs Speech to Text

Class Name

ElevenLabsSpeechToText

Category
api node/audio/ElevenLabs

Author
ComfyAnonymous (Account age: 763days) Extension
ComfyUI Latest Updated
2026-05-13 Github Stars
112.77K

Github Ask ComfyAnonymous Current Questions Past Questions

Table of Content

Description
ElevenLabsSpeechToText:
ElevenLabsSpeechToText Input Parameters:
ElevenLabsSpeechToText Output Parameters:
ElevenLabsSpeechToText Usage Tips:
ElevenLabsSpeechToText Common Errors and Solutions:
Related Nodes

How to Install ComfyUI

Install this extension via the ComfyUI Manager by searching for ComfyUI

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

ElevenLabs Speech to Text Description

Transcribe audio to text with language detection, speaker diarization, and event tagging for versatile audio processing tasks.

ElevenLabs Speech to Text:

The ElevenLabsSpeechToText node is designed to transcribe audio into text, offering a seamless way to convert spoken words into written form. This node is particularly beneficial for applications requiring automatic language detection, speaker diarization, and audio event tagging, making it a versatile tool for various audio processing tasks. By leveraging advanced speech recognition technologies, it ensures accurate and efficient transcription, which can be invaluable for content creators, researchers, and developers looking to integrate speech-to-text capabilities into their projects. The node's ability to handle multiple languages and distinguish between different speakers enhances its utility in diverse scenarios, from transcribing interviews and meetings to processing multimedia content.

ElevenLabs Speech to Text Input Parameters:

audio_input

The audio_input parameter is the primary input for the node, where you provide the audio file that needs to be transcribed. This parameter accepts audio data in various formats, ensuring compatibility with a wide range of audio sources. The quality and clarity of the audio input can significantly impact the accuracy of the transcription, so it is advisable to use clear recordings with minimal background noise for optimal results.

language_detection

The language_detection parameter enables the node to automatically detect the language spoken in the audio input. This feature is particularly useful when dealing with multilingual content, as it allows the node to adapt its transcription process to the detected language, ensuring more accurate results. The parameter can be toggled on or off, depending on whether automatic language detection is desired.

speaker_diarization

The speaker_diarization parameter allows the node to identify and differentiate between multiple speakers in the audio input. This is especially useful in scenarios such as meetings or interviews, where it is important to attribute spoken words to the correct speaker. Enabling this feature can enhance the clarity and usefulness of the transcription by providing speaker labels alongside the transcribed text.

audio_event_tagging

The audio_event_tagging parameter enables the node to tag specific audio events within the transcription. This can include identifying pauses, background noises, or other significant audio cues that may be relevant to the context of the transcription. This feature adds an additional layer of detail to the transcription, making it more informative and contextually rich.

ElevenLabs Speech to Text Output Parameters:

transcribed_text

The transcribed_text parameter is the primary output of the node, providing the text transcription of the audio input. This output is a string of text that represents the spoken words in the audio file, converted into written form. The accuracy and detail of the transcribed text depend on the quality of the audio input and the configuration of the input parameters, such as language detection and speaker diarization.

speaker_labels

The speaker_labels parameter provides information about the different speakers identified in the audio input, if speaker diarization is enabled. This output includes labels or identifiers for each speaker, allowing you to distinguish between different voices in the transcription. This can be particularly useful for creating detailed and organized transcripts of conversations or interviews.

audio_event_tags

The audio_event_tags parameter offers a list of tagged audio events detected in the transcription, if audio event tagging is enabled. These tags provide additional context about the audio, such as identifying pauses, background noises, or other significant events. This output can enhance the understanding of the transcription by highlighting important audio cues.

ElevenLabs Speech to Text Usage Tips:

Ensure that the audio input is clear and free from excessive background noise to improve transcription accuracy.
Utilize the language detection feature for multilingual audio content to automatically adapt the transcription process to the correct language.
Enable speaker diarization for audio with multiple speakers to obtain a more organized and informative transcription with speaker labels.
Use audio event tagging to gain additional insights into the audio content by identifying significant audio cues and events.

ElevenLabs Speech to Text Common Errors and Solutions:

"Audio format not supported"

Explanation: The audio input provided is in a format that is not supported by the node.
Solution: Convert the audio file to a supported format, such as WAV or MP3, and try again.

"Language detection failed"

Explanation: The node was unable to automatically detect the language of the audio input.
Solution: Manually specify the language if known, or ensure the audio quality is sufficient for language detection.

"Speaker diarization error"

Explanation: The node encountered an issue while attempting to differentiate between speakers in the audio input.
Solution: Check the audio quality and ensure that the speakers are clearly distinguishable. Consider disabling speaker diarization if not needed.

"Audio event tagging not available"

Explanation: The node was unable to tag audio events due to limitations in the audio input or configuration.
Solution: Ensure that audio event tagging is enabled and that the audio input is suitable for event detection.

ElevenLabs Speech to Text Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI

Table of Content

Description
ElevenLabsSpeechToText:
ElevenLabsSpeechToText Input Parameters:
ElevenLabsSpeechToText Output Parameters:
ElevenLabsSpeechToText Usage Tips:
ElevenLabsSpeechToText Common Errors and Solutions:
Related Nodes

Qwen Image 2512 LoRA Inference | AI Toolkit ComfyUI

Use an AI Toolkit-trained LoRA with Qwen Image 2512 in ComfyUI via one RCQwenImage2512 node for preview-aligned generations.

Cosmos-Predict2 | Text2Image & Video2World

Fast and real! NVIDIA Cosmos with true physics.

LTX-2 ControlNet | Precision Video Generator

Sharp control, perfect sync, super clear AI video creation.

ReActor | Fast Face Swap

With ComfyUI ReActor, you can easily swap the faces of one or more characters in images or videos.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy

Save 4 hours! We auto-setup your workflow! Free!

ComfyUI Node: ElevenLabs Speech to Text

ElevenLabsSpeechToText

How to Install ComfyUI

ElevenLabs Speech to Text Description

ElevenLabs Speech to Text:

ElevenLabs Speech to Text Input Parameters:

audio_input

language_detection

speaker_diarization

audio_event_tagging

ElevenLabs Speech to Text Output Parameters:

transcribed_text

speaker_labels

audio_event_tags

ElevenLabs Speech to Text Usage Tips:

ElevenLabs Speech to Text Common Errors and Solutions:

"Audio format not supported"

"Language detection failed"

"Speaker diarization error"

"Audio event tagging not available"

ElevenLabs Speech to Text Related Nodes