I let a local LLM take control of the video doorbell. This is probably the future of smart cameras

AI Video & Visuals


Some Ring doorbells use AI features to interact with visitors when you’re not there. I ditched my Ring doorbell and bought a Reolink doorbell, which runs entirely locally, and wondered if I could recreate similar functionality using local LLM. It was partially successful.

What I wanted the doorbell to do

Concierge using AI

A woman and a man ring the occupied doorbell. Credit: Ring

This idea seemed quite plausible. When someone rings the doorbell and Home Assistant detects that no one is home, the doorbell explains to the caller that everyone is out and asks for their name and the reason for the call. You then need to listen to the response, process its content, and respond accordingly.

With a cloud-based LLM, this seems like a realistic goal. Text-to-speech and audio-to-text conversion can be done very easily using cloud-based services. The LLM is centrally located and takes what the caller says as input and generates the response that the doorbell speaks.

I knew this would be more difficult to do with a local LLM. My relatively weak hardware can only run small models, and these models may not be up to the job. I thought it might be worth a try to see if I can do everything locally.

Reolink Wi-Fi video doorbell.

solution

2K

power supply

battery

Reolink’s battery-powered Wi-Fi video doorbell is the perfect way to know who’s outside. With 2K resolution and a 150°x150° head-to-toe view, this video doorbell can be powered by battery or wired depending on your existing setup.


Setting method

TTS out, whisper in, orama in the middle

There were three main components needed to make this work. I needed a text-to-speech (TTS) method so the doorbell could speak aloud to the caller. We needed a speech-to-text (STT) method so that everything a caller said could be converted to written text and passed to the LLM. And we needed a way to run a local LLM, which would be the brains of the entire operation.

Thankfully, Home Assistant has some great options for each of these components. Piper is a local TTS engine that converts written text into speech that can be played from your doorbell. It runs completely locally and is lightweight enough to run on a Raspberry Pi 4.

A creepy notification from Home Assistant that someone is at your front door on your iPhone.

How to use AI to tell your home assistant who’s at your door

Get an AI-generated description of the person your video doorbell detects.

Whisper provides an equivalent local STT component. You can capture the audio recorded by the doorbell when the caller is speaking and convert it to text that can be passed to the local LLM. Again, it runs entirely locally. This was my goal for this project.

The final piece of the puzzle is Orama. This is a tool that allows you to run local large-scale language models on your own hardware. There is a Home Assistant integration that you can use to connect Ollama to Home Assistant.

The bottleneck is the ability of the LLM model to run. Weaker hardware can only run smaller, less capable models, and the larger the model you try to run, the slower it may respond. I had to use a fairly small model so that generating the response didn’t take too long.

reality didn’t match my hopes

Good concept but poor execution

Reolink video doorbell in the rain. Credit: Leolink

It took a while to set everything up. As always with home assistants, most of the hard work was done by other people. There’s a handy GitHub Gist that explains how to play audio and TTS through a Reolink doorbell that I found very helpful.

I had an issue where the audio capture would start while the audio greeting from the doorbell was still playing, which messed things up, but I eventually found a way around it.

The first part of my idea worked. When the doorbell is pressed, LLM generates an audio greeting and plays it through the doorbell’s speaker. Explain that everyone is out of town and ask the caller’s name and the purpose of the call.

The doorbell then records the voice response and STT converts it to text. So far, so good.

The problem was that when I tried to have a two-way conversation with the AI-powered doorbell, it didn’t work. The little LLM gets confused and starts talking nonsense and takes too long to get a response.

This concept could work even better with a strong enough LLM running the show. But until you win the lottery, you stick with what you have.

We’ve built a viable alternative

It’s actually a pretty solid setup.

A notification that forwards messages left on your video doorbell.

The main problem was trying to talk to the caller, so I simply cut out that part of the process. Instead, the caller gives their name and reason for the call, STT converts this to text, and that text is sent to my phone as a notification. The doorbell then announces that it will pass on the message and ends the conversation.

This means that every time someone rings your doorbell while you’re out, you’ll receive a notification letting you know who it is and why they’re calling. It works reasonably well most of the time, but occasionally you’ll get a slightly hilarious notification if something goes wrong. However, in most cases this is a really useful feature.


this is the direction the world is heading

The current trend is that AI is being introduced into everything, and it doesn’t seem like it’s going to slow down anytime soon. Ring’s AI-powered concierge is convenient, but it doesn’t have a great reputation when it comes to privacy. The good news is that you can recreate at least some of these features completely locally with little effort.



Source link