A powerful, local text-to-speech system powered by Llasa TTS models. This project offers a modern, interactive dashboard that supports multiple model sizes (1B, 3B, and 8B) and introduces a new Podcast Mode for multi-speaker conversation synthesis.
The Llasa TTS Dashboard transforms traditional text-to-speech pipelines into a robust, user-friendly application. With efficient GPU utilization, and flexible generation controls, the dashboard is designed for both developers and end users who demand high-quality speech synthesis.
- Multi-Model Support
Switch easily between 1B, 3B, and 8B models.
- Standard TTS Mode
Generate natural-sounding speech either from plain text or with a reference audio prompt.
- Podcast Mode
Create multi-speaker podcasts from transcripts. Configure reference audio and seeds for each speaker to produce consistent character voices.
- Advanced Generation Controls
Fine-tune parameters such as max length, temperature, and top-p. Use random or fixed seeds for reproducibility.
- Clean & Modern UI
A sleek, two-panel interface built with Gradio. Enjoy a dark theme that enhances readability.
-
Python 3.10+
-
CUDA-Capable NVIDIA GPU
-
VRAM Requirements:
-
8.5 GB+ VRAM: When running with Whisper Large Turbo in 4-bit mode.
-
6.5 GB+ VRAM: When running without Whisper and using the LLM in 4-bit mode.
git clone https://github.com/nivibilla/local_llasa_tts.git
cd local_llasa_tts
If you're on Windows, this works best when using WSL2. Install the necessary dependencies:
pip install -r requirements_base.txt
pip install -r requirements_native_hf.txt
You can start the application in several ways:
From the project root directory, execute:
python -m src.main
- Unix/Linux/Mac:
Make sure run.sh
is executable and run it:
chmod +x run.sh
./run.sh
- Windows:
Double-click run.bat
or run it from the command prompt:
run.bat
-
Model Selection: Choose between 1B, 3B, or 8B.
-
Generation Mode: Select "Text only" or "Reference audio."
-
Advanced Settings: Adjust max length, temperature, and top-p.
-
Output: Listen to the synthesized speech and review previous generations.
-
Transcript Input: Enter a conversation transcript with each line formatted as
Speaker Name: message
. -
Speaker Configuration: Optionally provide reference audio and seeds for each speaker.
-
Advanced Settings: Configure generation parameters similar to Standard TTS.
-
Output: Generate a complete podcast audio file with seamless transitions.
Screenshot:
- Long Text Inference:
Refer to llasa_vllm_longtext_inference.ipynb for handling long text inputs using VLLM and chunking.
- Google Colab:
If you do not have a suitable local GPU, try our Colab Notebook.
-
Original LLaSA Training Repository: Inspired by zhenye234/LLaSA_training.
-
Gradio Demo Inspiration: UI concepts adapted from mrfakename/E2-F5-TTS.