By Zsolt Bizderi — 08 Jan 2025

Home Assistant Voice - Software & Hardware

Home Assistant Voice Hardware: A neat ESP32 device revolutionizing smart home voice control.

In December, Home Assistant dropped a long-awaited piece of hardware that many of us in the smart home community wanted to get our hands on – their dedicated voice assistant hardware. Naturally, I ordered a few right away.

For years, my Home Assistant instance has been the backbone of my home, controlling almost all electronics. But one glaring pain point remained – the absence of a reliable voice assistant that could rival Google Assistant or Siri. The 2023 voice control revamp was a step in the right direction, but without hardware, controlling the house meant reaching for my phone, computer, or dealing with third-party ESP32 builds with poor microphones, weak speakers, or clunky designs.

The Home Assistant Voice Preview is an ESP32-powered box that runs on ESPHome. It has a beautiful design that fits naturally into any smart home setup.

In this post, I'll walk through my experience with the device, why it stands out from other ESP32 solutions, and how it integrates with Home Assistant to deliver a polished and responsive voice assistant experience.

The setup process for Home Assistant's new voice hardware is refreshingly simple. The kit includes clear documentation guiding you through connecting it to your existing Home Assistant instance. In this guide, I'll walk through both the workflow behind the hardware and the physical box itself. This assumes you already have a basic understanding of how Home Assistant’s voice assistant operates.

The Voice Assistant System

The Home Assistant voice assistant operates through three core components:

Speech to Text (STT) – Converting voice input into text.
Context Analysis – Interpreting the text and determining the appropriate response or action.
Text to Speech (TTS) – Converting the response back into spoken words.

1. Speech to Text (STT)

I experimented with a few STT solutions before settling on Vosk. While Whisper is the official recommendation, I found Vosk to be faster and more responsive, essential for quick interactions. Vosk is also part of the Wyoming protocol, making integration simple.

2. Context Analysis

For context analysis, I explored a few options:

Home Assistant's Native Tool – Great for basic device control, but lacks general conversational features. You have to be very specific with the structure of your sentences, I abandoned it fairly quickly.
Local LLM (Ollama) – I initially tried using my local LLM running on Ollama. However, responses were slow (~10 seconds), making it impractical for daily use.
OpenAI API – Ultimately, I opted for OpenAI's API via the OpenAI Conversation integration. It’s fast, responsive, and can handle general questions as well as home control. The OpenAI API uses credits that are separate from regular ChatGPT subscriptions, but for my usage, credits last a long time. To control devices, ensure entities are properly exposed to the conversation agent.

3. Text to Speech (TTS)

Here’s where things got interesting. I tested several TTS options to find the most natural and pleasing voice:

Piper – The default Wyoming TTS. Functional but too robotic (think Microsoft Sam). Quickly abandoned.
Google Translate – Surprisingly effective, with more human-like voices. A solid mid-tier option.
ElevenLabs – The gold standard. Extremely realistic voices, but expensive. Performance-wise, nothing else compares. The ElevenLabs integration handles this beautifully.

My Ideal Stack: Vosk (STT) + OpenAI (conversational agent) + ElevenLabs (TTS). The combination delivers near-instantaneous, human-like responses that surpass commercial assistants like Alexa. I would go as far as saying it is actually like talking to a human.

The Hardware – Design, Functionality, and Daily Use

The design of the Home Assistant voice hardware is clearly inspired by the original iPod, complete with the iconic click wheel. This design choice adds a touch of nostalgia while remaining practical and intuitive.

What Stands Out:

Microphones – High quality, sensitive enough to reliably catch the wake word from across the room. (Currently, custom wake words aren’t supported, but I hope that changes in future updates.)
LED Ring – The RGB LED circle is bright, vivid, and responsive. It adds a great visual cue when interacting with the assistant.
Rotary Disk & Button – The center button and rotating wheel aren’t just for show – they’re fully functional Home Assistant entities. The rotary action can adjust volume, while the button can trigger automations or control entities.

Where It Could Be Better:

Speakers – This is the one area where the device falls short. At max volume, the audio can sound a little boxy and lacks the fullness you’d expect from an Amazon Echo or Google Nest. However, the slim profile and sleek form factor make up for this. It feels like a design trade-off – compact and stylish, but not quite an audio powerhouse, although, it won't be used for playing music anyway.

I’m adding one to every room. Being able to talk to Home Assistant directly has completely changed how I interact with my smart home. It’s the missing piece that bridges automation and voice control in a way that finally feels complete. Honestly, I couldn’t be happier with it.