Interactive avatar sessions now start more predictably

·

We shipped a set of updates that make an interactive avatar session feel more predictable from the first moment.

The biggest change is around the first message. Builders can decide whether the persona opens the conversation at all, and whether that opening greeting can be interrupted by the user.

That sounds small until you are building a real product around it.

Some sessions should begin with a warm greeting. Some should wait for the user. Some greetings should be interruptible because the user may already know what they want. Others should play to completion because the opening contains instructions, consent language, or setup context.

This release brings five practical changes:

  1. The first message can be disabled, requiring the user to start the conversation.

  2. The opening greeting can be interruptible or play to completion before user input is accepted.

  3. Tool-driven turns can now be protected from interruption while your app finishes the action.

  4. Sessions can start at high video quality with sessionOptions.videoQuality.

  5. One-shot avatar refinement now preserves flat and near-solid backgrounds more reliably in both Lab and the /v1 avatar creation flow.

An interactive avatar session opens more predictably when the builder controls who speaks first and whether the first greeting can be interrupted. Disabling the first message makes the user initiate the conversation; disabling interruption makes the greeting play to completion before accepting input.

Protected tool turns extend the same idea beyond the greeting. They stop interruptions while an app is still handling a tool-driven action, which is useful when the avatar needs to complete a multi-step flow before speaking again.

sessionOptions.videoQuality lets developers choose whether an avatar session starts at auto quality or pins the session to high video quality from the beginning. This helps sessions reach their intended bitrate faster when visual quality matters from the first frame.

Why do interactive avatar session openings matter?

The first few seconds of an interactive avatar session set the tone for the whole experience.

The question is not only "does the avatar speak?" It is "should the avatar speak first, and should the user be able to interrupt?"

For a website concierge, a short greeting may help. For a kiosk, the user may need to tap or speak first. For a training simulation, the avatar may need to deliver the scenario before the learner responds. For a compliance-heavy workflow, the opening may need to play all the way through.

These are product decisions, not model decisions. The builder should own them.

The same is true for tool calls and media startup. If the first frame is slow to sharpen, the session feels less polished. If a user interrupts while the app is still resolving an action, the conversation can feel unstable. If an avatar created with a clean background suddenly picks up invented scenery during refinement, the builder loses trust in the creation flow.

We see this most clearly in live product use cases:

  1. Sales coaching agents that need to fetch CRM context before responding

  2. Training simulations that need to set the next scenario cleanly

  3. Support agents that need to check account state or ticket history

  4. Product onboarding agents that should either greet first or wait for a user action

  5. Live video agents that need to look sharp as soon as the session starts

The more interactive the product becomes, the more these timing details matter.

What changed for the first message?

Builders now have clearer control over the first message behavior in session setup.

There are two controls that matter:

Control

What it does

When to use it

Disable

Disables the first message, requiring the user to initiate the conversation

Kiosks, embedded product assistants, forms, and flows where the user action should come first

Interruptible

Lets the user interrupt the greeting by speaking

Open-ended assistants, support flows, and demos where the user may already know what they want

When Interruptible is off, the greeting plays to completion before accepting input. That is useful when the first message contains instructions, context, or a required opening.

When Disable is on, there is no first message. The persona waits for the user.

This gives builders a cleaner way to shape the opening moment. You can make the session feel proactive, user-led, or instruction-led without building awkward workarounds around the first turn.

What changed for tool-driven turns?

Tool-driven turns can now optionally suppress interruptions while your app is still handling the action.

That matters when a tool is doing more than returning a short answer. A tool may be:

  1. Calling an external API

  2. Updating a CRM record

  3. Switching a scene or visual state

  4. Writing to a database

  5. Waiting for a human-owned system to confirm the next step

Before this release, a user interruption could arrive while that action was still in progress. In some flows that is fine. In others, it can create a messy state: the avatar has moved on, the app has not finished, and the end user hears a response that no longer matches what is happening.

Protected tool turns give builders a cleaner option. When a tool action needs to run without being interrupted, the session can hold that protection until the turn is done.

We also tightened the cleanup path. If a greeting or tool turn finishes without spoken output, interrupt protection is now released cleanly. That reduces the chance of a session getting stuck in a protected state.

This is a small product detail with a very real user-facing effect. Longer and multi-step tool flows now behave in a way builders can reason about.

What changed for initial video quality?

Developers can now set sessionOptions.videoQuality to high or auto.

The default auto mode still lets the session adapt. The new high option pins the session to start at the maximum video bitrate instead of ramping up from the default profile.

This is useful when the opening visual quality matters:

  1. A sales or onboarding agent joining a live call

  2. A product demo where the avatar is the first thing the user sees

  3. A kiosk or lobby screen where the session starts in front of a customer

  4. A training simulation where the avatar needs to feel credible immediately

Anam sessions run over real-time media, so the first seconds involve the same practical constraints that matter in any browser video experience. The WebRTC API is designed for live audio and video in the browser, and Anam builds on that kind of real-time streaming foundation so avatars can respond without pre-rendered playback.

For builders, the point is not to tune media internals by hand. The point is having one clean option when you know the session should start sharp.

What changed in Lab?

New personas and built-in agent templates now default to GPT OSS 120B instead of GPT OSS 20B.

This improves reasoning quality and tool use out of the box. Builders starting from a template should get a stronger baseline before they have written any custom instructions or changed model settings.

We also fixed a one-shot avatar refinement issue in Lab. When a builder created an avatar with a plain or near-solid background, Gemini refinement could sometimes add scenery, textures, or objects that were never requested.

That is the wrong kind of surprise.

One-shot creation is most useful when the builder can trust the input. If the background is clean, it should stay clean. This fix now preserves those backgrounds more reliably during refinement.

If you want the broader story behind one-shot creation, we wrote about turning one photo into a live interactive AI persona and the model work behind Cara 3 interactive avatars.

What changed in the SDK and API?

The same background-preservation fix now applies to the /v1 avatar creation flow. API-created avatars are less likely to pick up hallucinated scenery during refinement when the input has a plain or near-solid background.

That matters for teams creating avatars programmatically. If your product lets users upload profile photos, generate brand characters, or build training personas at scale, predictable refinement is part of the product contract.

The media update also lands in the SDK/API layer through sessionOptions.videoQuality.

At a high level, the choice is:

Option

What it does

When to use it

auto

Lets the session use the default adaptive profile

General sessions where network conditions vary

high

Starts the session at maximum video bitrate

Demo, onboarding, kiosk, sales, and other visual-first sessions

For teams already building with frameworks like LiveKit, Pipecat, or custom WebRTC flows, this gives more control over the visual start state while keeping Anam responsible for the avatar media layer.

How does this affect builders?

The release is mostly about giving builders more control over the beginning of a live session.

For app builders, that means:

  1. The persona can greet first, or wait for the user.

  2. The greeting can be interruptible, or play to completion.

  3. Tool flows can run without being interrupted mid-action when that is the right behavior.

  4. Protected turns release cleanly when there is no spoken output.

  5. Sessions can start at high video quality when first impression matters.

  6. New Lab templates start from a stronger default model.

  7. One-shot avatars with clean backgrounds are more likely to stay clean.

None of these changes asks builders to rethink their architecture. They make the existing architecture easier to trust.

That has been an ongoing theme for us. The hard part of real-time interactive avatars is not just generating a face that looks good. It is making the full session behave predictably while speech, tools, media, app state, and user interruptions are all happening at once.

For related engineering patterns, see our posts on adding a face to an ElevenLabs voice agent, LiveKit avatar agents, adaptive bitrate streaming for avatars, and what interactive avatars mean for businesses.

What should you try first?

Start with the first message.

Ask which opening behavior your product needs:

  1. Should the avatar greet first, or should the user start?

  2. If the avatar greets first, can the user interrupt?

  3. Does the greeting contain instructions or consent language that needs to play to completion?

  4. Does your app need to fetch context before the conversation begins?

  5. Is the avatar the first visual element the user sees?

If you are building a tool-heavy avatar experience, test protected turns on the longest action in your flow.

Good candidates:

  1. Account lookups

  2. Booking flows

  3. Scene changes

  4. CRM updates

  5. Payment or eligibility checks

  6. Any action where the avatar should not move on until the app is done

If your first frame matters, try sessionOptions.videoQuality: "high" and compare the opening few seconds against auto.

And if you are generating avatars from clean input images, run a few one-shot refinements again. The background behavior should now be much closer to what you intended.

The goal is straightforward: builders should be able to create interactive AI avatars that start cleanly, speak at the right time, and keep the visual identity they were given.

Frequently asked questions

What does disabling the first message do?

Disabling the first message means the persona does not generate or play an opening greeting. The user must initiate the conversation before the avatar responds.

What does the Interruptible setting do?

The Interruptible setting controls whether a user can interrupt the opening greeting by speaking. When it is disabled, the greeting plays to completion before the session accepts user input.

What are protected tool turns?

Protected tool turns let a session suppress interruptions while an app is still handling a tool-driven action. They are useful for longer or multi-step flows where the avatar should wait until the action is complete before moving on.

Why does initial video quality matter for interactive avatars?

Initial video quality matters because the first few seconds shape how polished and trustworthy a live avatar session feels. sessionOptions.videoQuality: "high" helps a session start at the intended bitrate faster when visual quality matters immediately.

What does sessionOptions.videoQuality do?

sessionOptions.videoQuality controls whether an Anam session starts in auto video quality or pins the session to high quality. Use high for visual-first sessions such as demos, onboarding, kiosks, and sales conversations.

What changed in Anam Lab?

New personas and built-in agent templates now default to GPT OSS 120B, giving builders stronger reasoning and tool use from the start. Lab also preserves plain and near-solid one-shot avatar backgrounds more reliably during refinement.

Does the avatar background fix apply to the API?

Yes. The same background-preservation fix now applies to the /v1 avatar creation flow, so API-created avatars are less likely to pick up hallucinated scenery during refinement.

How do these updates help Anam developers?

These updates make Anam sessions easier to reason about during tool calls, session startup, and one-shot avatar creation. Developers get more control over the opening experience without adding extra state handling in their own apps.

Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content