daVinci-MagiHuman Workflow in ComfyUI | Audio-Video Human Synthesis

ComfyUI daVinci-MagiHuman Workflow

Want to run this workflow?

Fully operational workflows
No missing nodes or models
No manual setups required
Features stunning visuals

ComfyUI daVinci-MagiHuman Examples

daVinci-MagiHuman talking digital human workflow for ComfyUI#

This ComfyUI workflow builds a full text-to-video pipeline around daVinci-MagiHuman to generate realistic talking digital humans with synchronized speech, lip movement, expression and body micro‑motion. It is designed for creators who want a fast, single-click path from a descriptive prompt to an MP4 with clean audio. The graph can animate a freshly generated portrait or any supplied reference image, then renders video and speech together, finishing with optional upscaling and automatic audio loudness normalization.

The daVinci-MagiHuman core uses a single-stream Transformer to co-generate video and audio from one prompt, which helps preserve timing and lip-sync fidelity even on short clips. This ComfyUI implementation keeps the controls simple: write an Image Prompt to define the look, a Video Prompt to define performance and dialogue, set the clip Duration, and run.

Key models in ComfyUI daVinci-MagiHuman workflow#

daVinci-MagiHuman (15B single‑stream audio‑video generator). Role: jointly produces video frames and speech from text while maintaining temporal consistency and lip‑sync. References: GitHub, arXiv, Hugging Face.
T5Gemma 9B encoder (UL2‑adapted). Role: encodes the Video Prompt into rich conditioning that steers motion, delivery and style for daVinci‑MagiHuman. Reference: Hugging Face.
Z‑Image Turbo diffusion model. Role: quickly generates a high‑quality still portrait from the Image Prompt to use as identity/reference for animation. References: Hugging Face (z_image_turbo), Hugging Face (z_image).
Qwen 3 4B text encoder for Z‑Image Turbo. Role: parses the Image Prompt to guide portrait generation. Reference: Hugging Face file.
Wan 2.2 VAE. Role: decodes MagiHuman video latents to RGB frames with strong temporal consistency. References: GitHub, Hugging Face example model.
Audio VAE (sd_audio). Role: decodes MagiHuman audio latents to a speech waveform for muxing with the final video. Reference: custom node bundle for MagiHuman GitHub.
RTX Video Super Resolution (optional). Role: post‑upscales decoded frames to increase perceived sharpness and reduce compression artifacts before final encode. Reference: ComfyUI wrapper GitHub.

How to use ComfyUI daVinci-MagiHuman workflow#

Overall flow: the Z‑Image Turbo group creates an identity portrait from your Image Prompt. The MagiHuman Models group loads the daVinci‑MagiHuman checkpoint, video VAE and audio VAE, and prepares the text encoder. The Text Prompt group turns your Video Prompt into conditioning. The Sampling group fuses the reference image and prompt into joint video and audio latents, then decodes both. Finally, the Outputs stage muxes frames with audio into MP4, with an optional upscaled version.

Inputs#

Use the Image Prompt and Video Prompt text boxes to describe look and performance. The Duration control sets the clip length in seconds. An audio loader is present for convenience if you plan to experiment with audio‑driven variants, but this template runs in text‑driven mode by default.

ZImage Turbo#

This stage renders a single reference portrait from the Image Prompt using the Z‑Image Turbo UNet with the Qwen 3 4B text encoder and its bundled VAE. It is optimized for fast, clean identity generation with cinematic looks. The result is previewed, then forwarded as the reference image for animation. If you already have a headshot, you can bypass this by wiring your image directly to the animation stage.

MagiHuman Models#

Here the graph loads the daVinci‑MagiHuman base or distilled checkpoint along with the Wan 2.2 video VAE, the audio VAE, and the T5Gemma encoder. This keeps text encoding, video latents and audio latents aligned for single‑stream sampling. You can swap weights if you have alternatives available in your environment.

Text Prompt#

Your Video Prompt is encoded into positive and negative conditioning. Positive text should describe camera distance, pose, language, delivery style and the exact dialogue content. Negative text can list visual or audio defects to avoid. The encoder feeds both sets of conditioning into the sampler to shape motion, lip dynamics and timbre.

Sampling#

The sampler builds an initial latent sequence from the reference image and the requested Duration, then performs denoising with daVinci‑MagiHuman to produce synchronized video and audio latents. A utility converts Duration to whole seconds for stable scheduling. When sampling completes, the video latents go to the video decoder and the audio latents go to the audio decoder.

Decode, loudness and export#

Video latents are decoded with the Wan 2.2 VAE to image frames. Audio latents are decoded to speech, then normalized to broadcast‑friendly loudness so the final MP4 plays back consistently across devices. Two exports are produced: a base render and an optional upscaled render using RTX Video Super Resolution. Both are muxed to MP4 with audio and saved with clear filename prefixes.

Key nodes in ComfyUI daVinci-MagiHuman workflow#

MagiHuman_LATENTS (#13)

Builds the joint latent canvas for video and optional audio, taking the reference image and clip length. Adjust seconds to set duration and ensure your reference image is well framed for the motion you describe. Higher base resolution helps facial fidelity but also increases VRAM and decode time.

MagiHuman_SM_ENCODER (#95)

Encodes the Video Prompt into positive and negative conditioning for the sampler. Put the exact spoken line in quotes and name the language to improve lip closure and timing. Use the negative field to suppress artifacts like “subtitles,” “static,” or “room echo.”

MagiHuman_SM_KSampler (#9)

Runs daVinci‑MagiHuman denoising to co‑generate video and speech latents. The seed controls reproducibility, while steps and the internal schedule trade speed for detail and motion stability. For variation without losing identity, change seed or lightly rephrase the performance portion of your prompt.

MagiHuman_EN_DECO_VIDEO (#5)

Decodes video latents with the Wan 2.2 VAE into RGB frames for export or upscaling. Use this path for the fastest end‑to‑end render; long clips or higher resolutions will linearly increase decode time.

MagiHuman_DECO_AUDIO (#6)

Decodes audio latents to waveform and sends them through loudness normalization for even playback. If you later switch to audio‑driven generation, route your external audio into the latent builder and keep this decode path for final muxing.

RTXVideoSuperResolution (#93)

Optional post‑upscaler that sharpens edges and reduces ringing. Use moderate strength to improve clarity without introducing temporal shimmer.

Optional extras#

Prompting pattern for reliable lip‑sync: include a speaker tag and language plus a quoted line, for example Dialogue: <Presenter, English>: "Welcome to the show." Add a brief note about delivery, shot size and camera stability.
Keep the reference portrait as a medium close‑up with the head fully inside frame. Tight crops leave little room for jaw and cheek dynamics.
If you need stricter timing, trim or extend your script to match the chosen Duration. Very long sentences in very short clips can force unnatural articulation.
This template runs in prompt‑only mode. For audio‑driven tests, connect an external audio file to the audio input on MagiHuman_LATENTS (#13) and adjust your Video Prompt to describe expression rather than spoken content.

Acknowledgements#

This workflow implements and builds upon the following works and resources. We gratefully acknowledge daVinci-MagiHuman for the daVinci-MagiHuman Workflow Source for their contributions and maintenance. For authoritative details, please refer to the original documentation and repositories linked below.

Resources#

daVinci-MagiHuman/Workflow Source
- Docs / Release Notes: daVinci-MagiHuman Workflow Source

Note: Use of the referenced models, datasets, and code is subject to the respective licenses and terms provided by their authors and maintainers.

Want More ComfyUI Workflows?

LTX 2.3 ID-LoRA | Talking Avatar Generator

Creates true-to-life talking avatars with synced voice and visuals.

DreamID-Omni | Photo to Talking Video Maker

Turns photos into ultra-real talking videos in seconds.

InfiniteTalk | Lip-Synced Avatar Generator

Photo + Voice = Perfectly Synced Talking Avatar in Minutes

MultiTalk | Photo to Talking Video

Millisecond lip sync + Wan2.1 = 15s ultra-detailed talking videos!

Sonic | Lip-Sync Portrait Animation

Sonic delivers advanced audio-driven lip-sync for portraits with high-quality animation.

Wan 2.1 LoRA

Enhance Wan 2.1 video generation with LoRA models for improved style and customization.

FLUX Img2Img | Merge Visuals and Prompts

Merge visuals and prompts for stunning, enhanced results.

Trellis | Image to 3D

Trellis is an advanced Image-to-3D model for high-quality 3D assets generation.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

daVinci-MagiHuman | Realistic Talking Human Generator