LTXV Reference Audio (ID-LoRA):
LTXVReferenceAudio is a specialized node designed for the purpose of speaker identity transfer using ID-LoRA technology. This node encodes a reference audio clip into a conditioning format that can be used to guide the identity of a speaker in audio synthesis tasks. By leveraging this node, you can effectively transfer the unique vocal characteristics of a reference speaker to another audio sample, enhancing the speaker identity effect. This is achieved through an additional forward pass without the reference audio, which amplifies the speaker identity effect in the generated audio. The node is particularly useful in applications where maintaining or transferring speaker identity is crucial, such as in voice cloning or personalized text-to-speech systems.
LTXV Reference Audio (ID-LoRA) Input Parameters:
model
The model parameter refers to the audio synthesis model that will be used for processing the reference audio. It is crucial for defining the framework within which the speaker identity transfer will occur. This parameter does not have specific minimum, maximum, or default values as it depends on the model architecture you are working with.
positive
This parameter represents the positive conditioning set, which is used to guide the model towards desired outcomes. It is essential for setting the context in which the reference audio will be applied. The positive conditioning set is typically a collection of attributes or features that the model should emphasize.
negative
The negative parameter is the counterpart to the positive conditioning set. It is used to specify attributes or features that the model should avoid or minimize in the output. This helps in refining the speaker identity transfer by providing a balanced conditioning context.
reference_audio
The reference_audio parameter is the core input for this node, containing the audio clip that serves as the reference for speaker identity transfer. It must be at least 1.8 seconds long and no longer than 15.1 seconds. The audio is encoded into latents and patchified for integration into the model.
audio_vae
This parameter specifies the Audio Variational Autoencoder (VAE) model used for encoding the reference audio into a latent representation. The VAE model is crucial for transforming the audio waveform into a format that can be processed by the node.
identity_guidance_scale
The identity_guidance_scale parameter controls the strength of the identity guidance applied during the speaker identity transfer. It influences how prominently the reference speaker's identity is reflected in the output. This parameter typically ranges from 0 to a higher value, with higher values increasing the identity effect.
start_percent
This parameter defines the starting point of the identity guidance effect as a percentage of the total processing time. It allows you to control when the identity transfer begins during the audio synthesis process.
end_percent
Similar to start_percent, this parameter specifies the endpoint of the identity guidance effect as a percentage of the total processing time. It helps in determining the duration over which the speaker identity transfer is applied.
LTXV Reference Audio (ID-LoRA) Output Parameters:
waveform
The waveform output is the processed audio waveform that incorporates the speaker identity transfer. It reflects the unique vocal characteristics of the reference speaker as applied to the target audio.
sample_rate
This output parameter indicates the sample rate of the processed audio waveform. It is crucial for ensuring that the audio is played back at the correct speed and quality.
LTXV Reference Audio (ID-LoRA) Usage Tips:
- Ensure that your reference audio is between 1.8 and 15.1 seconds long to avoid errors related to audio duration.
- Experiment with the identity_guidance_scale to find the optimal balance between maintaining the reference speaker's identity and achieving natural-sounding audio.
LTXV Reference Audio (ID-LoRA) Common Errors and Solutions:
Reference audio is too short: <duration>s. Minimum duration is 1.8 seconds.
- Explanation: The reference audio provided is shorter than the required minimum duration of 1.8 seconds.
- Solution: Use a longer reference audio clip that meets the minimum duration requirement.
Total reference audio duration is <duration>s. Maximum is 15.1 seconds.
- Explanation: The combined duration of all reference audio clips exceeds the maximum allowed duration of 15.1 seconds.
- Solution: Reduce the total duration of the reference audio clips to comply with the maximum limit.
