RunComfy

ReActor | Fast Face Swap

Professional face swapping toolkit for ComfyUI that enables natural face replacement and enhancement.

Face Detailer | Fix Faces

Use Face Detailer first for facial restoration, followed by the 4x UltraSharp Model for superior upscaling.

Wan 2.2 | Open-Source Video Gen Leader

Available now! Better precision + smoother motion.

FLUX Kontext Face Swap | Seamless Face Replacement

Photoreal face replacement with prompt-guided control and natural blending

ComfyUI > Nodes > TrentNodes > Audio To Phonemes

ComfyUI Node: Audio To Phonemes

Class Name

AudioToPhonemes

Category
Trent/LipSync

Author
TrentHunter82 (Account age: 0days) Extension
TrentNodes Latest Updated
2026-03-20 Github Stars
0.03K

Github Ask TrentHunter82 Current Questions Past Questions

Table of Content

Description
AudioToPhonemes:
AudioToPhonemes Input Parameters:
AudioToPhonemes Output Parameters:
AudioToPhonemes Usage Tips:
AudioToPhonemes Common Errors and Solutions:
Related Nodes

How to Install TrentNodes

Install this extension via the ComfyUI Manager by searching for TrentNodes

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter TrentNodes in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

Audio To Phonemes Description

Converts audio to phonemes for precise lip-sync animation using Vosk speech recognition.

Audio To Phonemes:

The AudioToPhonemes node is designed to convert audio input into a sequence of phonemes, which are the distinct units of sound in speech. This node leverages the Vosk speech recognition system to extract phoneme-level timing information from audio files, making it particularly useful for applications such as lip-sync animation where precise timing of speech sounds is crucial. By transforming audio into phonemes, this node enables more accurate and realistic synchronization of animated characters' lip movements with spoken words. The node is capable of handling different languages and model sizes, providing flexibility and adaptability to various use cases. Its primary goal is to facilitate the creation of lifelike animations by providing detailed phonetic data that can be used to drive mouth movements in sync with audio tracks.

Audio To Phonemes Input Parameters:

audio

The audio parameter is a dictionary that contains the waveform and sample rate of the audio input. This parameter is crucial as it provides the raw audio data that will be processed to extract phonemes. The waveform represents the audio signal, while the sample rate indicates the number of samples per second, which affects the quality and accuracy of the phoneme extraction. There are no specific minimum or maximum values for this parameter, but the audio quality and sample rate should be sufficient to ensure accurate phoneme recognition.

model_size

The model_size parameter specifies which Vosk model to use for phoneme extraction. It impacts the accuracy and speed of the recognition process. The available options are typically "small" and "large," with the "small" model being faster but less accurate, and the "large" model providing more precise results at the cost of increased processing time. The default value is "small," which is suitable for most general purposes, but for tasks requiring higher accuracy, the "large" model may be preferred.

language

The language parameter determines the language of the audio input for recognition purposes. It ensures that the phoneme extraction process is tailored to the specific phonetic characteristics of the language being processed. The default value is "en" for English, but other languages can be specified if supported by the Vosk models. This parameter is essential for accurate phoneme extraction, as different languages have distinct phonetic structures.

Audio To Phonemes Output Parameters:

phoneme_data_list

The phoneme_data_list is a list of dictionaries, each containing information about individual phonemes extracted from the audio. Each dictionary includes the start and end times of the phoneme within the audio, as well as the phoneme itself. This output is crucial for applications like lip-sync animation, where precise timing of phonemes is needed to synchronize mouth movements with speech. The phoneme data provides a detailed breakdown of the audio's phonetic content, enabling more accurate and realistic animations.

audio_duration

The audio_duration output represents the total duration of the audio input in seconds. This value is important for understanding the length of the audio segment being processed and can be used to ensure that the phoneme data aligns correctly with the overall audio timeline. It provides context for the timing information in the phoneme_data_list, helping to maintain synchronization between audio and visual elements.

Audio To Phonemes Usage Tips:

Ensure that the audio input is clear and of high quality to improve the accuracy of phoneme extraction. Background noise or low-quality recordings can affect the results.
Choose the appropriate model_size based on your needs. Use the "small" model for faster processing and the "large" model for more accurate phoneme recognition, especially in complex audio scenarios.
Verify that the language parameter matches the language of the audio input to ensure accurate phoneme extraction. Using the wrong language setting can lead to incorrect phoneme data.

Audio To Phonemes Common Errors and Solutions:

Model not found

Explanation: This error occurs when the specified Vosk model is not available or cannot be found on the system.
Solution: Ensure that the correct model is downloaded and available in the expected directory. Check the model URL and download path for any discrepancies.

Unsupported language

Explanation: This error arises when the specified language is not supported by the available Vosk models.
Solution: Verify that the language parameter is set to a supported language. If necessary, download the appropriate language model from the Vosk website.

Audio processing error

Explanation: This error can occur if there is an issue with the audio input, such as an unsupported format or corrupted file.
Solution: Check the audio file for any issues and ensure it is in a supported format. Re-record or convert the audio if necessary to resolve the problem.

Audio To Phonemes Related Nodes

Go back to the extension to check out more related nodes.

TrentNodes

Table of Content

Description
AudioToPhonemes:
AudioToPhonemes Input Parameters:
AudioToPhonemes Output Parameters:
AudioToPhonemes Usage Tips:
AudioToPhonemes Common Errors and Solutions:
Related Nodes

Flux Kontext Character Turnaround Sheet LoRA

Generate 5-pose character turnaround sheets from single image

Qwen-Image | HD Multi-Text Poster Generator

New Era of Text Generation in Images!

Fantasy Portrait | Expressive Photo Animation

Photo → expressive cinematic face animation, fast and identity-accurate.

Wan2.2 VACE Fun | Image to Animated Video

Turn still photos into lifelike animated videos with custom prompts.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy