Coqui TTS with GPU Acceleration for a custom Home Assistant Integration

A deep dive into building a GPU-powered, fully local Coqui TTS server with Home Assistant integration.

Many text-to-speech solutions rely on cloud services, which introduces privacy concerns. Self-hosting a TTS engine like Coqui TTS not only gives you full control over your voice pipeline but also allows for offline operation and integration with your own AI stack.

Using GPU acceleration significantly improves TTS performance, enabling real-time synthesis with lower latency and better quality. In this guide, I'll go through how to deploy Coqui TTS with GPU support using Proxmox passthrough, and integrate it into Home Assistant as a custom TTS provider.

In this guide, you'll learn how to install Coqui TTS on a GPU-passthrough VM using Proxmox, configure it for high-performance synthesis, and integrate it seamlessly into Home Assistant for real-time smart home voice feedback. This setup pairs well with self-hosted LLMs like Ollama and gives you full control over your voice assistant stack.


What Is This?

This is a fully self-hosted, privacy-respecting, GPU-accelerated text-to-speech server using Coqui TTS. We're using it alongside an AI chatbot (Ollama) and Home Assistant Voice for a powerful offline voice assistant.


GPU Passthrough with Proxmox

GPU passthrough is a technique that allows a virtual machine to directly access a physical GPU on the host system. This is crucial for machine learning and TTS applications like Coqui, which benefit significantly from the parallel processing power of GPUs. Without passthrough, the VM would rely on emulated or software rendering, which is too slow for real-time voice synthesis.

Step 1: Configure Your Proxmox VM

Hardware:

  • RAM: 4GB
  • CPU: 2 Cores (1 socket, 2 cores), set to x86-64-v2-AES
  • BIOS: UEFI (OVMF)
  • TPM: Enabled
  • EFI Disk: Present
  • GPU: PCI passthrough enabled (using Device ID)

Step 2: Enable IOMMU and Passthrough

On your Proxmox host:

  • Enable IOMMU in BIOS.
  • Set kernel params:
    • For Intel: intel_iommu=on
    • For AMD: amd_iommu=on

Verify your /etc/pve/qemu-server/<vmid>.conf includes something like:

hostpci0: 0000:01:00.0,pcie=1

Step 3: Unbind Host Drivers

Check with:

lspci -nnk | grep -A 2 -i nvidia

Ensure the GPU uses vfio-pci and not nouveau or nvidia. If it doesn't, blacklist nouveau and bind your GPU to vfio-pci using /etc/modprobe.d/.


Install Coqui TTS with GPU Acceleration

Step 1: Install Ubuntu Server 24.04 LTS (with third-party drivers checked)

Once booted:

Step 2: Verify GPU Visibility

lspci | grep -i vga

Should return something like:

01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)

Then run:

lshw -c video

It should show:

configuration: driver=nvidia latency=0

If you see nouveau, the NVIDIA driver is not installed yet.

Step 3: Install NVIDIA Driver

sudo apt update
sudo ubuntu-drivers devices
sudo apt install nvidia-driver-550
sudo reboot

After reboot:

nvidia-smi

You should see your GPU listed.

Step 4: Disable Secure Boot (If Necessary)

If NVIDIA drivers still don’t load:

  • Boot the VM and press ESC to access UEFI
  • Disable Secure Boot

Reboot again and confirm with nvidia-smi.


Install Python and Coqui TTS

Step 1: Install Python 3.11 via pyenv

sudo apt install -y git curl build-essential libssl-dev zlib1g-dev libbz2-dev \
     libreadline-dev libsqlite3-dev wget llvm libncurses5-dev libncursesw5-dev \
     xz-utils tk-dev libffi-dev liblzma-dev python3-openssl

curl https://pyenv.run | bash

Add pyenv to your shell (e.g. bashrc/zshrc), then restart.

pyenv install 3.11.9
pyenv virtualenv 3.11.9 coqui-tts
pyenv activate coqui-tts

Step 2: Install Coqui TTS

pip install --upgrade pip
pip install TTS[server]

Run Coqui TTS with Preferred Model

Coqui TTS supports a wide range of models, but for general-purpose English synthesis with high quality and decent speed, the Tacotron2-DDC model paired with HiFi-GAN v2 vocoder is a reliable choice.

  • Tacotron2-DDC is a variation of Tacotron2 with "double decoder consistency" that improves robustness and naturalness in speech generation.
  • HiFi-GAN v2 is a fast neural vocoder that converts spectrograms into realistic audio waveforms in real-time.

Together, they offer a good balance between audio quality and inference speed, especially when GPU acceleration is available.

Run the server using the following command:

tts-server \
  --model_name tts_models/en/ljspeech/tacotron2-DDC \
  --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
  --use_cuda true

Test via curl:

curl http://<your-server-ip>:5002/process?INPUT_TEXT=this+is+a+test > test.wav
tts-server \
  --model_name tts_models/en/ljspeech/tacotron2-DDC \
  --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
  --use_cuda true

Then visit:

http://<your-server-ip>:5002

Test via curl:

curl http://<your-server-ip>:5002/process?INPUT_TEXT=this+is+a+test > test.wav

Auto-Start with systemd

Create a systemd service file:

sudo nano /etc/systemd/system/tts-server.service
[Unit]
Description=Coqui TTS Server
After=network.target

[Service]
User=youruser
WorkingDirectory=/home/youruser/coqui-tts
ExecStart=/home/youruser/.pyenv/versions/coqui-tts/bin/tts-server \
  --model_name tts_models/en/ljspeech/tacotron2-DDC \
  --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 \
  --use_cuda true
Restart=always

[Install]
WantedBy=multi-user.target

Then run:

sudo systemctl daemon-reload
sudo systemctl enable tts-server
sudo systemctl start tts-server
sudo systemctl status tts-server

Home Assistant Integration

Home Assistant supports multiple TTS platforms out of the box, like Google Cloud, Amazon Polly, and Microsoft Azure. However, these cloud-based services may introduce privacy concerns. By creating a custom TTS integration, you can route all voice synthesis through your own Coqui TTS server, keeping everything local, fast, and secure.

This custom component fetches generated audio directly from your self-hosted Coqui TTS server and feeds it into Home Assistant's media pipeline. It behaves like a native integration and can be used in automations, scripts, or notifications just like any built-in provider.

MaryTTS, which was previously supported by Home Assistant, is now discontinued and no longer maintained. Unfortunately, there are currently no other officially supported local TTS integrations for Home Assistant that offer modern performance. As a result, building a custom integration like this one is currently the only reliable way to add high-quality, GPU-accelerated TTS support using Coqui.

Step 1: Create a Custom Component

In config/custom_components/coqui_tts/, add:

manifest.json

{
  "domain": "coqui_tts",
  "name": "Coqui TTS",
  "version": "1.0.0",
  "requirements": ["requests"],
  "codeowners": ["@yourgithub"]
}

tts.py

import requests
import logging
from homeassistant.components.tts import Provider
from urllib.parse import quote_plus

_LOGGER = logging.getLogger(__name__)

def get_engine(hass, config, discovery_info=None):
    return CoquiTTSProvider(config)

class CoquiTTSProvider(Provider):
    def __init__(self, config):
        self._lang = "en"
        self._base_url = config.get("base_url")
        self._name = "CoquiTTS"

    @property
    def default_language(self):
        return self._lang

    @property
    def supported_languages(self):
        return ["en"]

    @property
    def name(self):
        return self._name

    def get_tts_audio(self, message, language, options=None):
        try:
            encoded_text = quote_plus(message)
            url = f"{self._base_url}/process?INPUT_TEXT={encoded_text}"
            resp = requests.get(url, timeout=30)

            if resp.status_code != 200:
                _LOGGER.error("Coqui TTS request failed: %s", resp.text)
                return (None, None)

            return ("wav", resp.content)

        except Exception as e:
            _LOGGER.error("Error connecting to Coqui TTS: %s", e)
            return (None, None)

__init__.py

# Empty file

Step 2: Update configuration.yaml

tts:
  - platform: coqui_tts
    base_url: http://[IP]:5002

Test It!

Use a script or automation in Home Assistant:

service: tts.coqui_tts_say
data:
  entity_id: media_player.kitchen_speaker
  message: "The oven has reached 200 degrees."

Success!

You now have a GPU-accelerated local TTS with full control of your voice assistant stack and a simple Home Assistant integration.

audio-thumbnail
E8d942ab 1b3a 4ad6 a0ea e3b6386df766
0:00
/8.731429

From here, you can do even more:

  • Try different voice models, including multilingual ones.
  • Integrate with voice cloning tools for personalization (results may vary).
  • Expand with Whisper or Ollama to create a full speech-based assistant pipeline. I set this up in Home Assistant using Home Assistant Voice, linking my Ollama instance and the new TTS server.