Coqui TTS with GPU Acceleration for a custom Home Assistant Integration
A deep dive into building a GPU-powered, fully local Coqui TTS server with Home Assistant integration.
Many text-to-speech solutions rely on cloud services, which introduces privacy concerns. Self-hosting a TTS engine like Coqui TTS not only gives you full control over your voice pipeline but also allows for offline operation and integration with your own AI stack.
Using GPU acceleration significantly improves TTS performance, enabling real-time synthesis with lower latency and better quality. In this guide, I'll go through how to deploy Coqui TTS with GPU support using Proxmox passthrough, and integrate it into Home Assistant as a custom TTS provider.
In this guide, you'll learn how to install Coqui TTS on a GPU-passthrough VM using Proxmox, configure it for high-performance synthesis, and integrate it seamlessly into Home Assistant for real-time smart home voice feedback. This setup pairs well with self-hosted LLMs like Ollama and gives you full control over your voice assistant stack.
What Is This?
This is a fully self-hosted, privacy-respecting, GPU-accelerated text-to-speech server using Coqui TTS. We're using it alongside an AI chatbot (Ollama) and Home Assistant Voice for a powerful offline voice assistant.
GPU Passthrough with Proxmox
GPU passthrough is a technique that allows a virtual machine to directly access a physical GPU on the host system. This is crucial for machine learning and TTS applications like Coqui, which benefit significantly from the parallel processing power of GPUs. Without passthrough, the VM would rely on emulated or software rendering, which is too slow for real-time voice synthesis.
Step 1: Configure Your Proxmox VM
Hardware:
- RAM: 4GB
- CPU: 2 Cores (1 socket, 2 cores), set to
x86-64-v2-AES
- BIOS: UEFI (OVMF)
- TPM: Enabled
- EFI Disk: Present
- GPU: PCI passthrough enabled (using Device ID)
Step 2: Enable IOMMU and Passthrough
On your Proxmox host:
- Enable IOMMU in BIOS.
- Set kernel params:
- For Intel:
intel_iommu=on
- For AMD:
amd_iommu=on
- For Intel:
Verify your /etc/pve/qemu-server/<vmid>.conf
includes something like:
hostpci0: 0000:01:00.0,pcie=1
Step 3: Unbind Host Drivers
Check with:
lspci -nnk | grep -A 2 -i nvidia
Ensure the GPU uses vfio-pci
and not nouveau
or nvidia
. If it doesn't, blacklist nouveau
and bind your GPU to vfio-pci
using /etc/modprobe.d/
.
Install Coqui TTS with GPU Acceleration
Step 1: Install Ubuntu Server 24.04 LTS (with third-party drivers checked)
Once booted:
Step 2: Verify GPU Visibility
lspci | grep -i vga
Should return something like:
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
Then run:
lshw -c video
It should show:
configuration: driver=nvidia latency=0
If you see nouveau
, the NVIDIA driver is not installed yet.
Step 3: Install NVIDIA Driver
sudo apt update
sudo ubuntu-drivers devices
sudo apt install nvidia-driver-550
sudo reboot
After reboot:
nvidia-smi
You should see your GPU listed.
Step 4: Disable Secure Boot (If Necessary)
If NVIDIA drivers still don’t load:
- Boot the VM and press ESC to access UEFI
- Disable Secure Boot
Reboot again and confirm with nvidia-smi
.
Install Python and Coqui TTS
Step 1: Install Python 3.11 via pyenv
sudo apt install -y git curl build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev python3-openssl
curl https://pyenv.run | bash
Add pyenv
to your shell (e.g. bashrc/zshrc), then restart.
pyenv install 3.11.9
pyenv virtualenv 3.11.9 coqui-tts
pyenv activate coqui-tts
Step 2: Install Coqui TTS
pip install --upgrade pip
pip install TTS[server]
Run Coqui TTS with Preferred Model
Coqui TTS supports a wide range of models, but for general-purpose English synthesis with high quality and decent speed, the Tacotron2-DDC model paired with HiFi-GAN v2 vocoder is a reliable choice.
- Tacotron2-DDC is a variation of Tacotron2 with "double decoder consistency" that improves robustness and naturalness in speech generation.
- HiFi-GAN v2 is a fast neural vocoder that converts spectrograms into realistic audio waveforms in real-time.
Together, they offer a good balance between audio quality and inference speed, especially when GPU acceleration is available.
Run the server using the following command:
tts-server \
--model_name tts_models/en/ljspeech/tacotron2-DDC \
--vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
--use_cuda true
Test via curl:
curl http://<your-server-ip>:5002/process?INPUT_TEXT=this+is+a+test > test.wav
tts-server \
--model_name tts_models/en/ljspeech/tacotron2-DDC \
--vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
--use_cuda true
Then visit:
http://<your-server-ip>:5002
Test via curl:
curl http://<your-server-ip>:5002/process?INPUT_TEXT=this+is+a+test > test.wav
Auto-Start with systemd
Create a systemd service file:
sudo nano /etc/systemd/system/tts-server.service
[Unit]
Description=Coqui TTS Server
After=network.target
[Service]
User=youruser
WorkingDirectory=/home/youruser/coqui-tts
ExecStart=/home/youruser/.pyenv/versions/coqui-tts/bin/tts-server \
--model_name tts_models/en/ljspeech/tacotron2-DDC \
--vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
--use_cuda true
Restart=always
[Install]
WantedBy=multi-user.target
Then run:
sudo systemctl daemon-reload
sudo systemctl enable tts-server
sudo systemctl start tts-server
sudo systemctl status tts-server
Home Assistant Integration
Home Assistant supports multiple TTS platforms out of the box, like Google Cloud, Amazon Polly, and Microsoft Azure. However, these cloud-based services may introduce privacy concerns. By creating a custom TTS integration, you can route all voice synthesis through your own Coqui TTS server, keeping everything local, fast, and secure.
This custom component fetches generated audio directly from your self-hosted Coqui TTS server and feeds it into Home Assistant's media pipeline. It behaves like a native integration and can be used in automations, scripts, or notifications just like any built-in provider.
MaryTTS, which was previously supported by Home Assistant, is now discontinued and no longer maintained. Unfortunately, there are currently no other officially supported local TTS integrations for Home Assistant that offer modern performance. As a result, building a custom integration like this one is currently the only reliable way to add high-quality, GPU-accelerated TTS support using Coqui.
Step 1: Create a Custom Component
In config/custom_components/coqui_tts/
, add:
manifest.json
{
"domain": "coqui_tts",
"name": "Coqui TTS",
"version": "1.0.0",
"requirements": ["requests"],
"codeowners": ["@yourgithub"]
}
tts.py
import requests
import logging
from homeassistant.components.tts import Provider
from urllib.parse import quote_plus
_LOGGER = logging.getLogger(__name__)
def get_engine(hass, config, discovery_info=None):
return CoquiTTSProvider(config)
class CoquiTTSProvider(Provider):
def __init__(self, config):
self._lang = "en"
self._base_url = config.get("base_url")
self._name = "CoquiTTS"
@property
def default_language(self):
return self._lang
@property
def supported_languages(self):
return ["en"]
@property
def name(self):
return self._name
def get_tts_audio(self, message, language, options=None):
try:
encoded_text = quote_plus(message)
url = f"{self._base_url}/process?INPUT_TEXT={encoded_text}"
resp = requests.get(url, timeout=30)
if resp.status_code != 200:
_LOGGER.error("Coqui TTS request failed: %s", resp.text)
return (None, None)
return ("wav", resp.content)
except Exception as e:
_LOGGER.error("Error connecting to Coqui TTS: %s", e)
return (None, None)
__init__.py
# Empty file
Step 2: Update configuration.yaml
tts:
- platform: coqui_tts
base_url: http://[IP]:5002
Test It!
Use a script or automation in Home Assistant:
service: tts.coqui_tts_say
data:
entity_id: media_player.kitchen_speaker
message: "The oven has reached 200 degrees."
Success!
You now have a GPU-accelerated local TTS with full control of your voice assistant stack and a simple Home Assistant integration.
From here, you can do even more:
- Try different voice models, including multilingual ones.
- Integrate with voice cloning tools for personalization (results may vary).
- Expand with Whisper or Ollama to create a full speech-based assistant pipeline. I set this up in Home Assistant using Home Assistant Voice, linking my Ollama instance and the new TTS server.