Anam avatars join Stream's Vision Agents

·

You can now drop an Anam avatar into a Stream Vision Agents pipeline in two lines of Python and have it join a live video call with a face, a voice, and its own visual world. Picture an assistant who steps into a kitchen when you ask about dinner and back to a studio when you ask for the weather. That's what this integration is for.

We worked with the Stream team to land the AnamAvatarPublisher inside their Vision Agents framework, and then built a cookbook recipe to show off what the pair can do together. The short version: Vision Agents gives you the agent loop and the call, Anam gives you the interactive avatar, and the two compose cleanly.

Your face, your agent, one image

The thing people tend to underestimate about Anam is that the avatar is yours. You don't pick from a fixed library and hope one of them fits the brand. You upload a single photo, or describe the person you want, and Anam generates a custom avatar driven by the Cara model in Anam Lab. The same one-shot flow is what powered this recipe: we wanted a character that could stand in front of a chroma-key background, so we asked Lab for a time traveller on a pure green backdrop and had a usable avatar on the first try.

Under the call, every frame of that face is generated live from the agent's speech. Not a playback of pre-rendered clips. Lip sync, head movement, expressions, all in sub-second response time so the conversation feels like one, not a turn-taking demo. 50+ languages ship in the box.

That matters here because the moment you plug Anam into a Vision Agents pipeline, the agent stops being a voice and starts being a presence. Users talk to a face. The independently verified realism of the Cara model is part of why we think the experience lands.

Two lines into a Stream call

One line to declare the avatar, one line to plug it into the Agent. That is the whole integration:

avatar = AnamAvatarPublisher()
agent = Agent(..., processors=[avatar])
avatar = AnamAvatarPublisher()
agent = Agent(..., processors=[avatar])
avatar = AnamAvatarPublisher()
agent = Agent(..., processors=[avatar])

Instantiate the publisher, list it as a processor on the Agent, pick your favourite LLM and speech providers, and Stream handles the transport. Run the example and a video call opens in your browser with the avatar already in it. No WebRTC wiring, no signalling code, no audio sync to fight with. The Stream examples are genuinely some of the cleanest "first run and it works" onboarding we've seen.

From there, everything about the agent is pluggable. Swap Gemini for Claude or GPT, Deepgram for ElevenLabs, the face for a different persona you built in the Lab. The shape of the code stays the same.

A dynamic background, steered by the conversation

To show what's possible once you're past the baseline, we put a recipe in the cookbook: Vision Agents + Anam dynamic background switching. The runnable code lives in anam-cookbook/examples/vision-agents-anam-dynamic-background.

The experience: you ask the agent for cooking help, it slips into a kitchen and walks you through a recipe. You ask about the weather, it's suddenly in a studio. When you start talking again, it resets to a neutral office. Nothing about the scene changes is ever announced. It just matches the topic.

There are two hooks pulling this off, and both come straight from what Vision Agents already exposes.

The first is tool calling. Vision Agents lets the LLM register functions with a decorator, and the tool body can do anything before returning. So a provide_cooking_instructions tool calls avatar.set_scene("kitchen") as a side effect, then returns the recipe. The LLM never needs to know the avatar has scenes. It just picks the tool that answers the question, and the visual context follows.

The second is transcript and turn events. A subscriber on STTTranscriptEvent can peek at what the user just said and switch scenes based on keywords before the LLM has even replied. A subscriber on TurnStartedEvent resets the background to neutral the moment the user starts a new turn. You end up with predictable transitions without writing a state machine.

The background replacement itself is a small custom processor. We subclass AnamAvatarPublisher, override the video receiver, and run a chroma-key mask that swaps the green pixels for the current scene image before the frame is written back to the call. It's deliberately short, 30 or so lines, so it's easy to swap out for a smarter compositor if the product calls for one.

What this is actually useful for

The cooking demo is fun. The pattern behind it is what ships products.

Once the avatar is a regular processor in a pipeline, per-frame media processing becomes a seam you own. A retail kiosk that changes the backdrop to the product the shopper is asking about. A tutor whose scene shifts with the subject on screen. A concierge that adjusts the lighting of the room to the time of day. A demo booth whose avatar wears different outfits for different demos. None of that requires a fork of Vision Agents or Anam. It's the same two-line integration plus whatever your product needs layered on top.

We are most excited about what partners and customers will build on this. Stream's pipeline ergonomics plus Anam's custom avatars is a good combination of raw materials.

Try it

git clone https://github.com/anam-org/anam-cookbook.git
cd anam-cookbook/examples/vision-agents-anam-dynamic-background
uv sync
cp .env.example .env
uv run python main.py run
git clone https://github.com/anam-org/anam-cookbook.git
cd anam-cookbook/examples/vision-agents-anam-dynamic-background
uv sync
cp .env.example .env
uv run python main.py run
git clone https://github.com/anam-org/anam-cookbook.git
cd anam-cookbook/examples/vision-agents-anam-dynamic-background
uv sync
cp .env.example .env
uv run python main.py run

Fill in the .env with keys from Anam Lab, Stream, Google AI Studio, and Deepgram. Open the call URL the CLI prints, and ask for a quick pasta recipe. Then ask for the weather in Amsterdam. Watch the scene follow the conversation.

Anam pricing is per minute of avatar streaming. If you want to build a custom face for your brand, the API covers persona creation and the one-shot avatar flow end to end.

Links

Frequently asked questions

What is Vision Agents?

Vision Agents is Stream's open-source Python framework for building real-time voice and video AI agents. It handles the agent loop, call transport, and a pluggable pipeline of STT, LLM, TTS, and media processors so you can swap components without rewiring the call.

Why add a face to an AI voice agent?

A live face turns an assistant into a presence a user can read and trust. Lip sync, expression, and turn-taking cues carry meaning that voice alone cannot, which raises comprehension and engagement in live conversations.

How do you add an Anam avatar to a Stream Vision Agents pipeline?

Instantiate AnamAvatarPublisher() and pass it in the processors list when you create the Vision Agents Agent. That is the full baseline integration, and the avatar joins the Stream call automatically with the agent's voice driving the face.

Can you create a custom AI agent avatar from a single photo?

Yes. Anam Lab generates a fully custom avatar from a single uploaded photo or a text prompt using the Cara model, and that avatar can be referenced by ID in any Vision Agents or Stream pipeline.

Can the avatar's background change during a conversation?

Yes. The Anam cookbook recipe shows the avatar switching to a kitchen scene for cooking questions and a studio scene for weather questions using Vision Agents tool calls and transcript events, then resetting on the next user turn.

What does it cost to run an Anam avatar on Stream?

Anam charges per minute of avatar streaming, which is detailed on the Anam pricing page. Stream, the LLM, STT, and TTS providers you choose each bill separately through their own accounts.


Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content