kling-2-5/turbo/text-to-video
Generate fast, high quality videos from text with Kling 2.5 Turbo.
Omnihuman 1.5 transforms how you create realistic avatars through advanced multimodal generation.
Omnihuman 1.5 is a high-fidelity image-to-video and audio-to-video model that converts a single portrait and a voice track into realistic, lip-synced avatar performance. Built by ByteDance (Omnihuman Lab), it fuses a Multimodal LLM with a Diffusion Transformer to preserve facial structure, identity, and continuity while reacting to audio semantics. Components such as the “Pseudo Last Frame” and shared attention stabilize character features across frames and align motion with rhythm, prosody, and intent. Omnihuman 1.5 (also referred to as Omnihuman v1.5) adapts to narrative, dialogue, and singing, and can follow optional text guidance for camera and gesture control without sacrificing realism. Key capabilities:
Start by providing a clear portrait image_url and an audio_url under 35 seconds; add an optional prompt to specify camera movement, gesture emphasis, or emotion. Use mask_url to mark the speaking subject if multiple faces are present. Keep guidance concise and unambiguous (e.g., preserve identity, subtle head motion). Omnihuman 1.5 interprets audio semantics first, then applies prompt-level refinements, yielding coherent performance that aligns lip, gaze, and gesture with the track. Prompts can be written in Chinese, English, Japanese, Korean, Mexican Spanish, or Indonesian. Examples:





Generate fast, high quality videos from text with Kling 2.5 Turbo.
Create photo-based, speech-aligned videos with natural motion
High-speed text-to-motion generator for cinematic storytelling use.
Create fast, audio-enhanced visuals from text prompts
Animate a single image into a smooth video with Kling 2.1 Standard.
Generate cinematic videos from text prompts with Wan 2.1.
Omnihuman 1.5 is a multimodal avatar generation model that can transform a single image and an audio track into a realistic video using its image-to-video and audio-to-video capabilities. It produces expressive lip-sync, gestures, and motions aligned with the input speech and emotion.
Compared to its predecessor, Omnihuman 1.5 offers smoother motion transitions, longer-duration output, and higher emotional accuracy in image-to-video generation. Its improved architecture combines a Cognitive System 2 planner with a fast System 1 renderer, producing more natural results.
Omnihuman 1.5 is ideal for creators, educators, and developers needing digital humans for media, marketing, or storytelling. Its audio-to-video capabilities allow users to create talking avatars, virtual presenters, and multi-character scenes directly from simple inputs.
Omnihuman 1.5 is not entirely free. Users receive limited free credits upon registration in Runcomfy’s playground, after which additional usage requires paid credits. The model’s cost ranges around USD 0.14–0.16 per generated second depending on the chosen plan.
Omnihuman 1.5’s audio-to-video outputs reach film-grade realism, offering natural lip-syncing, expressive body language, and stable identity retention across long clips. Proper high-quality image and audio inputs are recommended for the best results.
You can access Omnihuman 1.5 through Runcomfy’s AI Playground website. After signing up, users can upload an image and audio file to experience its real-time image-to-video and audio-to-video generation directly from a browser.
While Omnihuman 1.5 is powerful, it may require a stable internet connection and sufficient credits to run long video tasks. The image-to-video and audio-to-video generation demands high-quality source data to maintain realism, and custom fine-tuning options are not yet open-source.
Omnihuman 1.5 stands out for its cognitive dual-system design, which improves semantic alignment and emotional detail versus many competing image-to-video or audio-to-video models. It also supports multi-character and stylized scenes, offering broader creative versatility.
Omnihuman 1.5 accepts standard image formats such as JPG or PNG and audio tracks in MP3 or WAV for image-to-video and audio-to-video conversion. Optional text prompts can guide gestures and camera angles for more personalized results.