Visit ComfyUI Online for ready-to-use ComfyUI environment
Specialized node for analyzing video content to detect and time mouth movements for text-to-speech synchronization.
The MouthMovementAnalyzer is a specialized node designed to analyze video content for detecting mouth movement timing, which is crucial for synchronizing text-to-speech (TTS) applications. This node leverages advanced computer vision techniques to identify and time mouth movements accurately, ensuring that the audio output aligns perfectly with the visual cues in a video. It supports multiple computer vision providers, offering flexibility and adaptability to different hardware and software environments. By providing detailed timing data, the MouthMovementAnalyzer enhances the realism and effectiveness of TTS systems, making it an invaluable tool for AI artists and developers working on projects that require precise audio-visual synchronization.
The video parameter is the input video file that the node will analyze to detect mouth movements. This parameter is essential as it provides the visual data necessary for the analysis process. The video should be in a compatible format and of sufficient quality to ensure accurate detection of mouth movements.
The provider parameter specifies the computer vision provider used for mouth movement detection. It offers several options, including MediaPipe, OpenSeeFace, and dlib. MediaPipe is preferred for its speed and accuracy, although it is incompatible with Python 3.13. OpenSeeFace is an alternative for newer Python versions but may be less accurate. dlib is a lightweight option that does not rely on machine learning dependencies and is expected to be available soon. The default provider is set based on the user's environment, and the choice of provider can significantly impact the performance and accuracy of the analysis.
The sensitivity parameter controls the detection sensitivity of mouth movements, with a range from 0.05 to 1.0. This parameter uses exponential scaling to provide fine control, especially at higher values. Lower values (0.05-0.2) detect only obvious movements, while higher values (0.9-1.0) capture ultra-sensitive movements, including whispers and micro-movements. The default value is 1.0, and users are encouraged to start with 0.5 and fine-tune in 0.01 increments to achieve the desired balance between sensitivity and accuracy.
The timing_data output provides detailed information about the timing of detected mouth movements within the video. This data is crucial for synchronizing TTS audio with the visual content, ensuring that speech aligns with the speaker's mouth movements.
The movement_frames output lists the specific frames in the video where mouth movements were detected. This information helps in pinpointing the exact moments of speech, allowing for precise editing and synchronization tasks.
The confidence_scores output offers a measure of the confidence level for each detected mouth movement. These scores help users assess the reliability of the detection results and make informed decisions about further processing or adjustments.
The preview_path output provides a path to a preview of the analyzed video, highlighting the detected mouth movements. This visual representation aids in verifying the accuracy of the analysis and making any necessary adjustments.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.