Qwen3-TTS Audio Compare:
The Qwen3AudioCompare node is designed to evaluate and compare two audio samples, specifically focusing on their speaker similarity, mel spectrogram distance, and speaking rate. This node is particularly useful for AI artists and developers working with text-to-speech (TTS) systems, as it provides a comprehensive analysis of how closely a generated audio sample matches a reference audio in terms of voice characteristics and pacing. By leveraging a speaker encoder model, the node calculates a speaker similarity score, which indicates how well the generated voice matches the reference voice. Additionally, it computes the mel spectrogram distance to assess the acoustic similarity and evaluates the speaking rate to ensure the generated audio maintains a natural pace. The node outputs a detailed report that includes these metrics, offering valuable insights into the quality and accuracy of TTS outputs.
Qwen3-TTS Audio Compare Input Parameters:
reference_audio
The reference_audio parameter is the original audio sample that serves as the benchmark for comparison. It is crucial for determining the baseline characteristics of the speaker's voice, which the generated audio will be compared against. This parameter should be provided in the ComfyUI audio format, which includes a waveform and a sample rate. The quality and clarity of the reference audio can significantly impact the accuracy of the comparison results.
generated_audio
The generated_audio parameter is the audio sample produced by a TTS system that you wish to evaluate. Like the reference audio, it should be in the ComfyUI audio format. This parameter is analyzed against the reference audio to determine how closely it matches in terms of speaker similarity, acoustic features, and speaking rate. Ensuring that the generated audio is of high quality will help in obtaining more reliable comparison results.
speaker_encoder_model
The speaker_encoder_model parameter specifies the model used to extract speaker embeddings from the audio samples. This model plays a critical role in calculating the speaker similarity score, which measures how closely the generated voice matches the reference voice. The choice of speaker encoder model can affect the sensitivity and accuracy of the similarity assessment.
local_model_path
The local_model_path parameter is an optional path to a locally stored speaker encoder model. If provided, the node will use this model instead of a default or pre-loaded model. This allows for flexibility in using custom or specialized models that may better suit specific use cases or provide improved performance for certain types of voices.
Qwen3-TTS Audio Compare Output Parameters:
report
The report output parameter is a comprehensive text report that summarizes the results of the audio comparison. It includes the speaker similarity score, mel spectrogram distance, speaking rate ratio, and an overall rating of the voice match quality. The report also provides an interpretation guide to help you understand the significance of the scores and metrics, making it easier to assess the performance of TTS systems and make informed decisions about potential improvements.
Qwen3-TTS Audio Compare Usage Tips:
- Ensure that both the reference and generated audio samples are of high quality and free from noise to improve the accuracy of the comparison results.
- Use a speaker encoder model that is well-suited to the type of voices you are working with, as this can significantly impact the speaker similarity score.
- Consider using the
local_model_pathparameter to experiment with different speaker encoder models and find the one that provides the best results for your specific application.
Qwen3-TTS Audio Compare Common Errors and Solutions:
"Speaker encoder model not found"
- Explanation: This error occurs when the specified speaker encoder model cannot be located or loaded.
- Solution: Ensure that the
speaker_encoder_modelparameter is correctly specified and that the model file is accessible. If using a local model, verify thelocal_model_pathis correct.
"Mismatch in sample rates"
- Explanation: This error indicates that the reference and generated audio samples have different sample rates, which can affect the comparison.
- Solution: Ensure both audio samples have the same sample rate before inputting them into the node. You may need to resample one of the audio files to match the other's sample rate.
"Invalid audio format"
- Explanation: This error suggests that the audio inputs are not in the expected ComfyUI format.
- Solution: Verify that both
reference_audioandgenerated_audioare provided in the correct format, including a waveform tensor and a sample rate integer.
