RunComfy

InfiniteTalk | Lip-Synced Avatar Generator

Photo + Voice = Perfectly Synced Talking Avatar in Minutes

Hallo2 | Lip-Sync Portrait Animation

Audio-driven lip-sync for portrait animation in 4K.

ComfyUI Trellis2 | Image-to-3D Mesh Generation Workflow

Convert images into structured, editable 3D meshes with precise geometry and topology control.

ReActor | Fast Face Swap

Professional face swapping toolkit for ComfyUI that enables natural face replacement and enhancement.

ComfyUI > Nodes > ComfyUI-AudioX > AudioX Multi-Modal Generation

ComfyUI Node: AudioX Multi-Modal Generation

Class Name

AudioXMultiModalGeneration

Category
AudioX/Generation

Author
lum3on (Account age: 314days) Extension
ComfyUI-AudioX Latest Updated
2025-06-24 Github Stars
0.04K

Github Ask lum3on Current Questions Past Questions

Table of Content

Description
AudioXMultiModalGeneration:
AudioXMultiModalGeneration Input Parameters:
AudioXMultiModalGeneration Output Parameters:
AudioXMultiModalGeneration Usage Tips:
AudioXMultiModalGeneration Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-AudioX

Install this extension via the ComfyUI Manager by searching for ComfyUI-AudioX

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-AudioX in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

AudioX Multi-Modal Generation Description

Facilitates multi-modal audio generation with AudioX framework for versatile creative production.

The AudioXMultiModalGeneration node is designed to facilitate the creation of audio content through a multi-modal approach using the AudioX framework. This node allows you to generate audio by leveraging various input modalities such as text, video, and images, providing a versatile tool for creative audio production. The primary goal of this node is to enable the synthesis of high-quality audio that can be conditioned on different types of input data, making it a powerful asset for artists looking to explore the intersection of audio and visual media. By utilizing advanced techniques in audio generation, this node enhances the creative process, allowing for the production of unique and contextually relevant audio outputs.

model

The model parameter specifies the AudioX model to be used for audio generation. This model serves as the backbone of the audio synthesis process, determining the quality and characteristics of the generated audio. It is crucial to select an appropriate model that aligns with your creative goals.

text_prompt

The text_prompt parameter is a string input that provides a textual description or instruction for the audio generation process. This prompt guides the model in creating audio that aligns with the specified theme or concept. The default value is "Generate audio," and it supports multiline input for more detailed descriptions.

steps

The steps parameter defines the number of diffusion steps to be used in the audio generation process. It influences the refinement and quality of the generated audio, with a higher number of steps generally leading to more detailed outputs. The default value is 250, with a range from 1 to 1000.

cfg_scale

The cfg_scale parameter is a float that controls the guidance scale during audio generation. It affects how closely the generated audio adheres to the input prompt, with higher values resulting in outputs that more closely match the prompt. The default value is 7.0, with a range from 0.1 to 20.0.

seed

The seed parameter is an integer used to initialize the random number generator for the audio generation process. It allows for reproducibility of results, enabling you to generate the same audio output given the same seed and other parameters. The default value is -1, which indicates a random seed, with a range from -1 to 2^32

duration_seconds

The duration_seconds parameter specifies the length of the generated audio in seconds. It determines the total duration of the audio output, allowing you to control the length of the generated content. The default value is 10.0 seconds, with a range from 1.0 to 30.0 seconds.

video

The video parameter is an optional input that allows you to provide a video file for conditioning the audio generation process. This input can be used to create audio that complements or enhances the visual content of the video.

image

The image parameter is an optional input that allows you to provide an image for conditioning the audio generation process. This input can be used to generate audio that is inspired by or related to the visual elements of the image.

audio

The audio parameter is an optional input that allows you to provide an existing audio file for conditioning the generation process. This input can be used to influence the style or characteristics of the generated audio based on the provided audio sample.

audio

The audio output parameter represents the generated audio content produced by the node. This audio output is the result of the multi-modal generation process, conditioned on the provided inputs such as text, video, image, or audio. It serves as the final product of the node's operation, ready for use in creative projects or further processing.

Experiment with different text_prompt inputs to explore a wide range of audio outputs. Detailed and descriptive prompts can lead to more nuanced and contextually rich audio generation.
Adjust the steps and cfg_scale parameters to find the right balance between audio quality and adherence to the input prompt. Higher values may improve quality but can also increase processing time.
Utilize the seed parameter to reproduce specific audio outputs, which is useful for iterative creative processes or when sharing results with collaborators.

Invalid model configuration

Explanation: This error occurs when the specified model configuration is not compatible with the node's requirements.
Solution: Ensure that the model provided is correctly configured and compatible with the AudioX framework. Verify that all necessary components and settings are in place.

Text prompt too vague

Explanation: A vague or insufficient text prompt can lead to unsatisfactory audio outputs.
Solution: Provide a more detailed and specific text prompt to guide the audio generation process effectively. Consider including descriptive elements or themes to enhance the prompt's clarity.

Duration exceeds maximum limit

Explanation: The specified duration for audio generation exceeds the maximum allowed limit.
Solution: Adjust the duration_seconds parameter to a value within the allowed range of 1.0 to 30.0 seconds.

AudioX Multi-Modal Generation Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-AudioX

Table of Content

Description
AudioXMultiModalGeneration:
AudioXMultiModalGeneration Input Parameters:
AudioXMultiModalGeneration Output Parameters:
AudioXMultiModalGeneration Usage Tips:
AudioXMultiModalGeneration Common Errors and Solutions:
Related Nodes

Animatediff V2 & V3 | Text to Video

Explore AnimateDiff V3, AnimateDiff SDXL and AnimateDiff V2, and use Upscale for high-resolution results.

Qwen-Image | HD Multi-Text Poster Generator

New Era of Text Generation in Images!

SDXL LoRA Inference | AI Toolkit ComfyUI

Run your AI Toolkit-trained SDXL LoRA in ComfyUI with training-matched defaults using a single RC custom node.

Hunyuan Video | Video to Video

Combine text prompt and source video to generate new video.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy

Save 4 hours! We auto-setup your workflow! Free!

ComfyUI Node: AudioX Multi-Modal Generation

AudioXMultiModalGeneration

How to Install ComfyUI-AudioX

AudioX Multi-Modal Generation Description

AudioX Multi-Modal Generation:

AudioX Multi-Modal Generation Input Parameters:

model

text_prompt

steps

cfg_scale

seed

duration_seconds

video

image

audio

AudioX Multi-Modal Generation Output Parameters:

audio

AudioX Multi-Modal Generation Usage Tips:

AudioX Multi-Modal Generation Common Errors and Solutions:

Invalid model configuration

Text prompt too vague

Duration exceeds maximum limit

AudioX Multi-Modal Generation Related Nodes