Dots TTS Whisper Transcribe:
The DotsTTSWhisperTranscribe node is designed to transcribe audio into text using the Whisper model, specifically for Dots TTS voice cloning applications. This node leverages advanced automatic speech recognition (ASR) capabilities to convert spoken language from audio files into written text, which can then be used as reference material for voice cloning tasks. By utilizing Whisper, a state-of-the-art ASR model, this node ensures high accuracy and efficiency in transcription, making it an invaluable tool for AI artists looking to create precise and reliable voice clones. The node is capable of handling various languages and can automatically detect the language of the audio, further enhancing its versatility and ease of use.
Dots TTS Whisper Transcribe Input Parameters:
audio
This parameter represents the reference audio that you want to transcribe. The audio is provided as a dictionary, and it serves as the primary input for the transcription process. The quality and clarity of the audio can significantly impact the accuracy of the transcription.
model
This parameter specifies the Whisper ASR model to be used for transcription. You can choose from several models, such as whisper-large-v3-turbo, whisper-medium, or whisper-small, among others. The default model is whisper-large-v3-turbo, which is known for its speed and accuracy. Selecting the appropriate model can affect the transcription speed and accuracy.
dtype
This parameter determines the precision of the Whisper model during transcription. Options include auto, bf16, and fp32. The default setting is auto, which automatically selects bf16 on supported CUDA/XPU devices and fp32 otherwise. The choice of precision can influence the performance and resource usage of the transcription process.
language
This parameter indicates the language of the reference audio. Options include auto, english, chinese, japanese, and several others. The default is auto, which allows the model to automatically detect the language. Specifying the language can improve transcription accuracy, especially for non-English audio.
task
This parameter defines the task to be performed by the Whisper model. Options are transcribe and translate. The default is transcribe, which retains the original language of the audio. Choosing translate will output the transcription in English, regardless of the original language.
chunk_length_s
This parameter sets the length of audio chunks to be processed at a time, measured in seconds. The default value is 30 seconds, with a minimum of 0 and a maximum of 120 seconds. Setting this to 0 allows the model to automatically determine the chunk length. Adjusting this parameter can help manage memory usage and processing time for longer audio files.
download_if_missing
This boolean parameter determines whether the Whisper model should be automatically downloaded if it is not already available. Setting this to True ensures that the necessary model files are retrieved, facilitating seamless transcription without manual intervention.
Dots TTS Whisper Transcribe Output Parameters:
transcript
The output of this node is a string containing the transcribed text from the reference audio. This transcript serves as a crucial component for voice cloning tasks, providing the textual reference needed to replicate the original voice accurately. The quality of the transcript can significantly impact the effectiveness of the voice cloning process.
Dots TTS Whisper Transcribe Usage Tips:
- Ensure that the reference audio is clear and free from background noise to improve transcription accuracy.
- Select the appropriate Whisper model based on your needs for speed and accuracy; larger models may offer better accuracy but require more computational resources.
- If you are working with non-English audio, specify the language to enhance transcription precision.
- Adjust the
chunk_length_sparameter to optimize performance for longer audio files, balancing between memory usage and processing time.
Dots TTS Whisper Transcribe Common Errors and Solutions:
Model not found
- Explanation: This error occurs when the specified Whisper model is not available locally.
- Solution: Ensure that
download_if_missingis set toTrueto automatically download the required model.
Unsupported language
- Explanation: The language specified is not supported by the Whisper model.
- Solution: Verify that the language is included in the
WHISPER_LANGUAGE_OPTIONSand adjust the parameter accordingly.
Audio format error
- Explanation: The provided audio file is in an unsupported format or is corrupted.
- Solution: Ensure the audio file is in a compatible format and is not corrupted before attempting transcription.
