F5-TTS

Local zero shot voice cloning from .wav files.

Device

Auto-detects best available option (GPU preferred)

Model

F5TTS_v1_Base (default, most recent model)
F5TTS_Base
E2TTS_Base

NFE Step

Number of steps to generate the voice. Higher values result in more detailed voices.

Chunk size

Split text into chunks of this size. Smaller values will increase responsiveness at the cost of lost context between chunks. (Stuff like appropriate inflection, etc.). 0 = no chunking

Replace exclamation marks

If checked, exclamation marks will be replaced with periods. This is recommended for F5TTS_v1_Base since it seems to over exaggerate exclamation marks.

Adding F5-TTS Voices

Voice Requirements

F5-TTS voices require:

Reference audio file (.wav format, 10-30 seconds)
Clear speech with minimal background noise
Single speaker throughout the sample
Reference text (optional but recommended)

Creating a Voice

Open the Voice Library
Click "Add Voice"
Select "F5-TTS" as the provider
Configure the voice:

Label: Descriptive name (e.g., "Emma - Calm Female")

Voice ID / Upload File Upload a .wav file containing the reference audio voice sample. The uploaded reference audio will also be the voice ID.

Use 6-10 second samples (longer doesn't improve quality)
Ensure clear speech with minimal background noise
Record at natural speaking pace

Reference Text: Enter the exact text spoken in the reference audio for improved quality

Enter exactly what is spoken in the reference audio
Include proper punctuation and capitalization
Improves voice cloning accuracy significantly

Speed: Adjust playback speed (0.5 to 2.0, default 1.0)

Tags: Add descriptive tags (gender, age, style) for organization

Extra voice parameters

There exist some optional parameters that can be set here on a per voice level.

Speed

Allows you to adjust the speed of the voice.

CFG Strength

A higher CFG strength generally leads to more faithful reproduction of the input text, while a lower CFG strength can result in more varied or creative speech output, potentially at the cost of text-to-speech accuracy.