ComfyUI > Nodes > ComfyUI > ElevenLabs Speech to Text

ComfyUI Node: ElevenLabs Speech to Text

Class Name

ElevenLabsSpeechToText

Category
api node/audio/ElevenLabs
Author
ComfyAnonymous (Account age: 763days)
Extension
ComfyUI
Latest Updated
2026-05-13
Github Stars
112.77K

How to Install ComfyUI

Install this extension via the ComfyUI Manager by searching for ComfyUI
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • 16GB VRAM to 80GB VRAM GPU machines
  • 400+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 200+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

ElevenLabs Speech to Text Description

Transcribe audio to text with language detection, speaker diarization, and event tagging for versatile audio processing tasks.

ElevenLabs Speech to Text:

The ElevenLabsSpeechToText node is designed to transcribe audio into text, offering a seamless way to convert spoken words into written form. This node is particularly beneficial for applications requiring automatic language detection, speaker diarization, and audio event tagging, making it a versatile tool for various audio processing tasks. By leveraging advanced speech recognition technologies, it ensures accurate and efficient transcription, which can be invaluable for content creators, researchers, and developers looking to integrate speech-to-text capabilities into their projects. The node's ability to handle multiple languages and distinguish between different speakers enhances its utility in diverse scenarios, from transcribing interviews and meetings to processing multimedia content.

ElevenLabs Speech to Text Input Parameters:

audio_input

The audio_input parameter is the primary input for the node, where you provide the audio file that needs to be transcribed. This parameter accepts audio data in various formats, ensuring compatibility with a wide range of audio sources. The quality and clarity of the audio input can significantly impact the accuracy of the transcription, so it is advisable to use clear recordings with minimal background noise for optimal results.

language_detection

The language_detection parameter enables the node to automatically detect the language spoken in the audio input. This feature is particularly useful when dealing with multilingual content, as it allows the node to adapt its transcription process to the detected language, ensuring more accurate results. The parameter can be toggled on or off, depending on whether automatic language detection is desired.

speaker_diarization

The speaker_diarization parameter allows the node to identify and differentiate between multiple speakers in the audio input. This is especially useful in scenarios such as meetings or interviews, where it is important to attribute spoken words to the correct speaker. Enabling this feature can enhance the clarity and usefulness of the transcription by providing speaker labels alongside the transcribed text.

audio_event_tagging

The audio_event_tagging parameter enables the node to tag specific audio events within the transcription. This can include identifying pauses, background noises, or other significant audio cues that may be relevant to the context of the transcription. This feature adds an additional layer of detail to the transcription, making it more informative and contextually rich.

ElevenLabs Speech to Text Output Parameters:

transcribed_text

The transcribed_text parameter is the primary output of the node, providing the text transcription of the audio input. This output is a string of text that represents the spoken words in the audio file, converted into written form. The accuracy and detail of the transcribed text depend on the quality of the audio input and the configuration of the input parameters, such as language detection and speaker diarization.

speaker_labels

The speaker_labels parameter provides information about the different speakers identified in the audio input, if speaker diarization is enabled. This output includes labels or identifiers for each speaker, allowing you to distinguish between different voices in the transcription. This can be particularly useful for creating detailed and organized transcripts of conversations or interviews.

audio_event_tags

The audio_event_tags parameter offers a list of tagged audio events detected in the transcription, if audio event tagging is enabled. These tags provide additional context about the audio, such as identifying pauses, background noises, or other significant events. This output can enhance the understanding of the transcription by highlighting important audio cues.

ElevenLabs Speech to Text Usage Tips:

  • Ensure that the audio input is clear and free from excessive background noise to improve transcription accuracy.
  • Utilize the language detection feature for multilingual audio content to automatically adapt the transcription process to the correct language.
  • Enable speaker diarization for audio with multiple speakers to obtain a more organized and informative transcription with speaker labels.
  • Use audio event tagging to gain additional insights into the audio content by identifying significant audio cues and events.

ElevenLabs Speech to Text Common Errors and Solutions:

"Audio format not supported"

  • Explanation: The audio input provided is in a format that is not supported by the node.
  • Solution: Convert the audio file to a supported format, such as WAV or MP3, and try again.

"Language detection failed"

  • Explanation: The node was unable to automatically detect the language of the audio input.
  • Solution: Manually specify the language if known, or ensure the audio quality is sufficient for language detection.

"Speaker diarization error"

  • Explanation: The node encountered an issue while attempting to differentiate between speakers in the audio input.
  • Solution: Check the audio quality and ensure that the speakers are clearly distinguishable. Consider disabling speaker diarization if not needed.

"Audio event tagging not available"

  • Explanation: The node was unable to tag audio events due to limitations in the audio input or configuration.
  • Solution: Ensure that audio event tagging is enabled and that the audio input is suitable for event detection.

ElevenLabs Speech to Text Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI
RunComfy
Copyright 2025 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

ElevenLabs Speech to Text