By Zsolt Bizderi in lab — 21 Aug 2025

Using AI Vision and Voice with Home Assistant

I gave my home vision and a voice of its own.

TL;DR: This post covers two ways of adding AI vision to Home Assistant; first with a proxy that pipes snapshots through Ollama, and then with Frigate 16’s new built-in GenAI and face recognition. I also show how to use Home Assistant’s new AI Task to make it task-driven and voice-enabled.

Last year on I wrote a guide on setting up Frigate with Home Assistant to create a privacy-friendly local CCTV system. That setup handled recognition well, but it had way too many components (Frigate, Double-Take, DeepStack).

This time I wanted to push it further. The idea was to have the CCTV actually describe what it sees in real time, and to do it entirely on my own hardware. No cloud services, no footage sent away to third parties, just Frigate, Ollama, and Home Assistant all talking to each other.

When I started building this, Frigate 0.16 wasn't out yet. I moved away from the original Frigate/Double-Take/DeepStack setup and I put together a custom proxy that Home Assistant could call, which in turn sent images to Ollama and got a description back. Later, when 0.16 beta landed, I found that Frigate had grown facial recognition and a Generative AI integration. At that point I could throw away my proxy and just let Frigate handle the whole thing. Much cleaner, fewer moving parts.

Then, in August, Home Assistant's latest release included an Ollama integration update; they added AI Task, essentially services that can be leveraged to handle various tasks where simple automations just won't work, allowing you to connect a vision model directly to the live CCTV feeds, essentially skipping Frigate's recognition model altogether.

So, here are all three methods. If you're on an older Frigate or want to micro-manage the prompts from inside Home Assistant, the DIY proxy works. If you're running Frigate 16 or newer, use the built-in GenAI. And finally, since Home Assistant added AI Task in the August update, I’ll show how I tied the whole lot into a voice routine.

What can this be used for?

A setup like this adds real intelligence to your home. Some examples:

Message you if an unfamiliar car parks outside your house and stays longer than 5 minutes.
Your house recognizes a stranger lingering by the gate and politely asks if they need help.
Ask Home Assistant "Who’s at the gate?" and have it reply with a live description, including names and whether they are family, friends or strangers.
For visually impaired users, voice queries like "What’s happening in the backyard?" could return a live description ("Two people are sitting by the patio table with drinks").
Detect if someone is carrying a package towards your door and and trigger a delivery routine even before they reach the doorbell. Similarly, your home "knows" you arrived with grocery bags and automatically unlocks the door and turns on the hallway lights.
Count how many people are in the living room and adjust the thermostat or ventilation accordingly.

Vision - Method 1 — Custom Ollama Proxy

This method is good if you're still on Frigate < 0.16, or if you like keeping all the prompt logic on the Home Assistant side. I originally built this to run to ditch Double Take for facial recognition.

How it works is straightforward:

Home Assistant detects a person via Frigate.
HA sends the snapshot URL to a small containerised FastAPI proxy.
The proxy fetches the image, encodes it in Base64, and passes it to Ollama.
Ollama runs a vision model locally and returns a short description.
HA includes the description and the image in a mobile notification.

Step 1 — Proxy

A minimal FastAPI app in Docker.

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests, base64

OLLAMA_URL = "http://OLLAMA-HOST:11434/api/generate"  # Ollama endpoint

app = FastAPI()

class DescribeIn(BaseModel):
    image_url: str
    model: str = "gemma3:4b"
    prompt: str = ("Describe the person succinctly: clothing, approx age, "
                   "accessories, action. One short sentence.")
    stream: bool = False
    timeout: int = 20

@app.post("/describe")
def describe(body: DescribeIn):
    try:
        r = requests.get(body.image_url, timeout=max(5, body.timeout // 2))
        r.raise_for_status()
        img_b64 = base64.b64encode(r.content).decode("utf-8")

        payload = {
            "model": body.model,
            "prompt": body.prompt,
            "images": [img_b64],
            "stream": body.stream
        }

        resp = requests.post(OLLAMA_URL, json=payload, timeout=body.timeout)
        resp.raise_for_status()
        j = resp.json()
        text = (j.get("response") or "").strip()
        return {"response": text or "No description returned by model."}
    except requests.exceptions.RequestException as e:
        raise HTTPException(status_code=502, detail=f"Ollama/proxy HTTP error: {e}")
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Proxy error: {e}")

Dependencies:

# requirements.txt
fastapi
uvicorn
requests

And the Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 5005
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5005"]

Build and run it:

docker build -t ollama-proxy .
docker run -d --name ollama-proxy -p 5005:5005 --restart unless-stopped ollama-proxy

Test with curl:

curl -s -X POST http://PROXY-IP:5005/describe \
  -H 'Content-Type: application/json' \
  -d '{"image_url":"https://images.pexels.com/photos/220453/pexels-photo-220453.jpeg"}'

You should get something like:

{"response":"A man wearing a black shirt looks at the camera."}

Step 2 — Home Assistant integration

configuration.yaml:

rest_command:
  describe_person:
    url: "http://PROXY-IP:5005/describe"
    method: POST
    content_type: "application/json"
    timeout: 25
    payload: >-
      {
        "image_url": "{{ image_url }}",
        "model": "gemma3:4b",
        "prompt": "Describe the person succinctly: clothing, approx age, accessories, action. One short sentence.",
        "stream": false
      }

Restart HA.

Step 3 — HA Automation

When a person is detected:

alias: Alarm - CCTV Unknown Person (REST Ollama via Proxy)
mode: parallel
trigger:
  - entity_id: sensor.double_take_unknown
    platform: state
action:
  - variables:
      filename: "{{ trigger.to_state.attributes.unknown.filename | default('') }}"
      token: "{{ trigger.to_state.attributes.token | default('') }}"
      person_image_url: >-
        {% set base = 'http://PROXY-IP:3010/api/storage/matches/' %}
        {% if filename %}
          {{ base ~ filename ~ '?box=true' ~ ( '&token=' ~ (token | urlencode) if token else '' ) }}
        {% else %}
          ''
        {% endif %}
  - data:
      image_url: "{{ person_image_url }}"
    response_variable: ollama
    action: rest_command.describe_person
  - variables:
      raw: "{{ ollama | default({}) }}"
      ai_msg: >-
        {% set raw_str = raw | tojson if raw is mapping else (raw | string) %}
        {% set m_json = raw_str | regex_findall('"response"\\s*:\\s*"([^"]*)"') %}
        {% set m_py   = raw_str | regex_findall("\\'response\\'\\s*:\\s*\\'(.*?)\\'") %}
        {% set text = (m_json[0] if m_json else (m_py[0] if m_py else '')) %}
        {{ text | replace('\n',' ') | regex_replace('\\s+',' ') | trim | capitalize | regex_replace('[.!?]+$','.') | default('Person detected while the alarm system is armed.', true) }}
  - service: notify.mobile_app
    data:
      title: CCTV
      message: "{{ ai_msg }}"
      data:
        image: "{{ person_image_url }}"

Vision - Method 2 — Frigate 16 with GenAI

From 0.16 onwards, Frigate gained facial recognition and GenAI integration. This means you can drop Double Take and the proxy. Frigate itself can recognise faces, send frames to Ollama (or OpenAI, if you wanted), and publish the description over MQTT.

Step 1: Enable face recognition in your Frigate config:

face_recognition:
  enabled: true

Train faces directly in the Frigate UI. You do not need to pre-train the model. When a person is detected, Frigate will save the snapshots, which can then be selected and assigned to a person. Pre-training helps with proactive recognition though.

Step 2: Enable GenAI with Ollama in your Frigate config:

semantic_search:
  enabled: true

genai:
  enabled: true
  provider: ollama
  base_url: http://OLLAMA-IP:11434
  model: llava:7b
  prompt: "Describe the person succinctly: clothing, approx age, accessories, action. One short sentence."
  cameras:
    front_door:
      genai:
        enabled: true
        objects: [person]

Frigate supports template variables like {camera} and {sub_label}, so if a recognised face is matched, you can pass the name into the prompt.

Step 3 — Home Assistant automation via MQTT:

alias: Alarm - CCTV Person (Frigate GenAI)
mode: parallel
trigger:
  - platform: mqtt
    topic: frigate/tracked_object_update
    value_template: "{{ value_json.type }}"
    payload: "description"

variables:
  event_id: "{{ trigger.payload_json.id }}"
  ai_msg: "{{ trigger.payload_json.description | default('Person detected.') }}"
  img_url: "http://FRIGATE-IP:5000/api/events/{{ event_id }}/snapshot.jpg?bbox=1"

action:
  - service: notify.mobile_app
    data:
      title: CCTV
      message: "{{ ai_msg }}"
      data:
        image: "{{ img_url }}"

That’s it. Much simpler.

The only downside of this setup, compared to the HA + Proxy route, is that Frigate doesn’t natively support schedules or conditions for when detections are passed to the LLM for description. For example, I’d prefer to only generate descriptions when I'm not home. While you can of course handle the notification logic inside Home Assistant automations, Frigate will still run the descriptions in the background regardless, meaning your GPU is burning cycles on results you'll never use.

So which should you use?

If you’re already on Frigate 16, just use the GenAI integration. It's built in, has face recognition, and makes everything cleaner. If you’re on an older release, or you like writing your own prompts in HA, the proxy method still works and gives you more control as you can specify the exact pipeline you are building.

Home Assistant AI Task

In August, Home Assistant rolled out AI Task with Ollama. You can now run not just conversational agents, but also automations that pass speicifc tasks to Ollama and get structured JSON back.

You could push image recognition through it, but I'll still keep Frigate’s face recognition for regular CCTV tasks, as it has face libraries and training built in, which HA doesn't. The AI Task can shine in other aspects in terms of vision, like the highly-specific task of counting objects as shown on the example page. Instead, I used AI Task for a voice routine.

The idea is simple: the first time I walk into my office in the morning, HA greets me, tells me the weather (inside and outdoors), highlights a few tech news headlines, and reads me my next calendar event. For voice, I'm using ElevenLabs for nicer quality in this example, but my other AI Task automations use Coqui TTS.

Helper

As I'm tracking the first time I walk into the office, the automation will need to be able to compare the current time against the last time the presence/motion sensor detected motion. To do this, I created a helper that records the last activity in configuration.yaml:

input_datetime:
  last_office_entry:
    name: Last Office Entry
    has_date: true
    has_time: false

Reload helpers / restart HA.

Automation

alias: AI Morning Routine (first office entry 06:00–20:00)
description: Generates a daily greeting using an AI Task on first office entry after 06:00.
triggers:
  - entity_id:
      - binary_sensor.office_motion_sensor_occupancy
    to: "on"
    trigger: state
conditions:
  - condition: time
    after: "06:00:00"
    before: "20:00:00"
  - condition: template
    value_template: >
      {{ states('input_datetime.last_office_entry') !=
      now().strftime('%Y-%m-%d') }}
actions:
  - target:
      entity_id: input_datetime.last_office_entry
    data:
      date: "{{ now().strftime('%Y-%m-%d') }}"
    action: input_datetime.set_datetime
  - data:
      task_name: greeting
      instructions: >-
        Given the current hour in 24-hour format, generate a friendly,
        context-appropriate, creative greeting for <Name> in the office. Respond
        only with the greeting—no explanations, no pretext.
      entity_id: ai_task.ollama_ai_task
    response_variable: greeting
    action: ai_task.generate_data
  - data:
      task_name: weather report
      instructions: >+
        Tell me what the weather is, jokingly. Keep it to maximum 2 sentences,
        no pretext, straight to the weather report, mentioning the temperature
        and humidity:

        - Outdoor temperature: {{ states('sensor.openweathermap_temperature')
        }}°C

        - Weather condition: {{ states('weather.openweathermap') }}

        - Office temperature: {{
        states('sensor.office_temperature_and_humidity_temperature') }}°C

        - Office humidity: {{
        states('sensor.office_temperature_and_humidity_humidity') }}%

      structure:
        weather_description:
          description: Short description of the outdoor weather and temperature
          required: true
          selector:
            text: null
      entity_id: ai_task.ollama_ai_task
    response_variable: comfort_report
    action: ai_task.generate_data
  - data:
      task_name: tech news
      instructions: >-
        From the list of news headlines below, select the three most important ones for <User>. Focus on wide impact and cybersecurity, privacy and important news. Use each headline only once. Write the result as two short, natural sentences, using only the article titles. Do not add explanations or prefaces.

          Headlines:
          {% for e in state_attr('sensor.bleepingcomputer', 'entries') or [] -%}
          - {{ e.title }}
          {% endfor %}
          {% for e in state_attr('sensor.arstechnica', 'entries') or [] -%}
          - {{ e.title }}
          {% endfor %}
      structure:
        news:
          required: true
          selector:
            text: null
      entity_id: ai_task.ollama_ai_task
    response_variable: tech_news
    action: ai_task.generate_data
    enabled: true
  - data:
      task_name: calendar
      structure:
        calendar:
          required: true
          selector:
            text: null
      entity_id: ai_task.ollama_ai_task
      instructions: >-
        Summarize the next calendar event for today in a short, natural-sounding sentence. Use the event title and the start time. Do not add explanations or prefaces.

          Calendar entries:
          {% set cal = states.calendar.user_work_calendar %}
          {% if cal.attributes.start_time is defined %}
          - {{ cal.attributes.message }} at {{ cal.attributes.start_time }}
          {% else %}
          - No events
          {% endif %}
    response_variable: calendar
    action: ai_task.generate_data
    enabled: true
  - action: tts.clear_cache
    metadata: {}
    data: {}
  - action: tts.speak
    metadata: {}
    data:
      cache: false
      media_player_entity_id: media_player.home_assistant_voice_media_player
      message: |
        {{ greeting.data }}

        {{ comfort_report.data.weather_description
           | replace("°C", " degrees")
           | replace("%", " percent") }}

        Now to the news headlines:
        {{ tech_news.data.news
           | regex_replace(', (?=[A-Z])', '. ')
           | regex_replace('[^a-zA-Z0-9 .!?]', '')
           | regex_replace('\s+', ' ')
           | trim }}

        Your next meeting is: {{ calendar.data.calendar }}
    enabled: true
    target:
      entity_id: tts.elevenlabs
mode: single

Feedparser is used for the RSS news. MS365-Calendar is used for the calendar integration.

So that's where things are now: from simple person detection last year, to actual descriptions this year, and now a smart voice routine layered on top. All local, no cloud, and honestly, a lot more fun to walk into the office in the morning.

This shift opens the door to a new generation of smart homes. Ones that are not only private and local-first, but also conversational, adaptive, and genuinely helpful. It's time for LLMs to make Home Assistant human-friendly.

Using AI Vision and Voice with Home Assistant

What can this be used for?

Vision - Method 1 — Custom Ollama Proxy

Step 1 — Proxy

Step 2 — Home Assistant integration

Step 3 — HA Automation

Vision - Method 2 — Frigate 16 with GenAI

So which should you use?

Home Assistant AI Task

Helper

Automation

Credential Harvesting with Evil Twins

n8n – Headless Browser Automation

What can this be used for?

Vision - Method 1 — Custom Ollama Proxy

Step 1 — Proxy

Step 2 — Home Assistant integration

Step 3 — HA Automation

Vision - Method 2 — Frigate 16 with GenAI

So which should you use?

Home Assistant AI Task

Helper

Automation

Credential Harvesting with Evil Twins

n8n – Headless Browser Automation

You might also like...