Building an interactive avatar that understands emotion
Most voice agents are still blind. They hear words, convert them to text, run that text through an LLM, and read the answer back. An interactive avatar can make the agent visible, but the next useful step is making it aware of what is happening on the user's side of the call.
Stream built a demo with Vision Agents and Inworld that does exactly that, using Anam as the live avatar layer: emotional-support.visionagents.ai.
The agent watches the user's face in real time, classifies emotion, gaze, and engagement from the video feed, and uses that state to decide what to say and how to say it. If you look sad, the voice slows down and softens. If you look excited, the delivery picks up. If you drift off-camera for a while, it nudges you back instead of sitting there doing nothing.
That sounds like a small UX detail until you try it. Then the standard voice-agent loop starts to feel very flat.
How does an interactive avatar use emotion and context?
The stack has seven moving parts:
Vision Agents for orchestration
MediaPipe for face tracking
Gemini for the LLM
Deepgram for speech-to-text
Inworld TTS-2 for expressive speech
Anam for the lip-synced avatar
Stream for the real-time media transport
The nice bit is that it still reads like one agent:
That is the point of Vision Agents. The framework treats speech, video, and processors as first-class parts of the same runtime. You can plug in any LLM, STT, TTS, or vision model, then run per-frame processing alongside the conversation loop.
Anam sits in that same processor list. The Anam Vision Agents integration receives the agent's audio output, generates lip-synced video, and publishes the avatar back into the Stream call. In the baseline setup, adding a face is one processor. In this demo, that face is also responding to live perception context.
This is the same general pattern behind building an AI voice agent with a face, but with a vision loop added on the user's side.
When should you use Stream versus Anam?
This demo is a good example of where the boundary sits.
Use Stream and Vision Agents when the hard part is the real-time agent runtime: getting audio and video into the call, running processors over the user's camera feed, routing events, handling the agent loop, and keeping the session low-latency over WebRTC.
Use Anam when the agent needs a live face. Anam takes the agent's speech output and generates a real-time, lip-synced avatar video track that can be published back into the call.
So in this demo, Stream gives the agent eyes and transport. Vision Agents gives the agent a composable runtime. Inworld gives the agent an expressive voice. Anam gives the agent a visible body in the conversation.
Those are complementary layers, not competing choices. If you are building a vision-heavy agent with object detection, face tracking, pose detection, or live video processing, start with Stream's Vision Agents. If that agent should appear as a real-time interactive AI avatar, add Anam as the avatar publisher.
What changes when the voice can be directed?
The most obvious difference is the voice.
Inworld TTS-2 supports natural-language steering. Instead of choosing from a tiny fixed set of emotional labels, the LLM can write a short director's note before the line:
Those notes are not UI labels. They are instructions to the speech model. The more specific the note, the more control you get over mood, pace, pitch, and vocal texture.
Inworld's model page lists Realtime TTS-2 as inworld-tts-2, with natural-language steering, real-time optimization, and 100+ language support in research preview. That matters here because the LLM is not just producing words. It is producing a spoken performance.
The model can also render inline non-verbal sounds:
And it handles emphasis from capitalization:
The trick is not sprinkling tags randomly. The tag should come from the user's current state. Sad and looking down should not get the same delivery as happy and looking into the camera. The face tracker gives the LLM that missing signal.
How does the face tracker feed the agent?
MediaPipe runs as a Vision Agents video processor.
Google's Face Landmarker can process live video frames and output face landmarks, 52 blendshape scores, and facial transformation matrices. That is plenty for this demo. It does not need clinical emotion recognition. It needs coarse, stable labels that an LLM can act on.
The processor extends VideoProcessor, attaches to the shared video forwarder, and runs at 8 fps:
Each frame goes through the same small pipeline:
Convert the frame to RGB
Run Face Landmarker
Extract blendshape coefficients
Classify emotion, gaze, and engagement
Store the current facial state
Emit an event when that state changes
The output is deliberately small:
The emotion classifier is threshold-based. A smile is (mouthSmileLeft + mouthSmileRight) / 2 above 0.45. Surprise fires when browInnerUp or jawOpen crosses 0.55. Gaze combines head-pose yaw and pitch with eye-look blendshapes. Engagement is derived from gaze and time.
There are two boring details that make this usable.
First, the system smooths the blendshape stream. Raw facial signals flicker. A single frame where the mouth twitches should not make the agent switch from neutral to happy and back again.
Second, the classifier uses hysteresis. A smile has to cross 0.45 to enter happy, but it only has to fall below 0.30 to leave it. A new emotion also needs to persist for four consecutive frames, which is about half a second at 8 fps.
Without that, the LLM would see a different user state every turn and the voice would bounce all over the place. This is one of those little pieces that is invisible when it works and very visible when it does not.
How does facial state reach the LLM?
The agent subclasses Agent and overrides simple_response.
Before the user text reaches the LLM, the agent reads the latest facial state and prefixes the transcript with a short context fragment:
So the model does not just see:
It sees:
That one line changes the response. The system prompt teaches the agent how to use the signal without being creepy about it. If the user looks sad and is looking down, answer briefly and gently. If the user looks happy and engaged, match the energy. If the user looks distracted, keep it short.
The state object converts itself into natural language:
This is basically context injection, but coming from the camera instead of the product UI. The agent should know what matters without forcing the user to narrate everything manually.
What makes proactive re-engagement feel less awkward?
The agent also reacts to silence.
If the user has been distracted or off-camera for five seconds and the agent is not already speaking, it fires a proactive cue:
The important instruction is what the agent should not do.
It should not say, "I notice you looking away." That feels like surveillance. The camera signal is for the model, not for the model to repeat back to the user.
The better version is softer:
Or, if the user looks distracted:
This is the line to keep in your head when building emotion-aware agents: perception should shape behavior, not become dialogue.
That is also why the visual layer matters. A voice-only agent can say the right words and still feel disconnected. A real-time interactive AI avatar gives the response a face, lip sync, micro-movement, and presence inside the call.
Why does Anam fit this stack?
Anam is the avatar renderer in the loop.
The agent produces audio. Anam's Cara model turns that audio into live video: synchronized lips, head motion, expression, and eye behavior. The generated track is published back into the call so the user sees the agent speaking in real time.
In Vision Agents, that is just another processor:
This is the same reason Anam works cleanly with other voice-agent stacks. The face layer does not need to own the brain. It can sit on top of Gemini, OpenAI, Mistral, Deepgram, Inworld, ElevenLabs, LiveKit, Pipecat, or whatever your agent already uses.
If you have a working voice agent, the avatar should be an interface layer, not a rewrite. Related Anam posts cover the same pattern for ElevenLabs voice agents, LiveKit avatar agents, and Pipecat avatar agents.
The new thing in this demo is that the user-facing video feed is no longer just transport. It becomes context.
Where does this pattern actually help?
The demo is framed as an emotional-support character. That is a good way to test the loop because tone matters immediately. If the agent misses the mood, you feel it.
But the pattern is bigger than companionship.
Sales coaching. An avatar watches a practice pitch, notices nervousness or disengagement, and gives feedback in a tone the learner can actually take in. If someone looks anxious, start with what worked before moving into corrections.
Recruitment and mock interviews. A candidate can practice with an interviewer who adapts to visible stress, asks follow-ups naturally, and gives feedback grounded in the session.
Education and tutoring. A tutor can see confusion before the student says "I'm confused." The furrowed brow and gaze drift are signals to slow down, change examples, or check understanding. The post on interactive avatars in learning and development covers why this kind of presence matters in training.
Customer support. A support agent can treat frustration as a real-time signal instead of waiting for a sentiment classifier after the conversation. The point is not to fake empathy. It is to route, shorten, or soften the interaction before it gets worse. That connects directly to the work around AI avatars for customer success.
Product assistants. An agent embedded inside software can combine product context with user state. The app can say "the user is on the billing page"; the camera can say "the user looks confused"; the LLM can answer with fewer assumptions.
Each use case has different privacy and safety boundaries. A coaching app and a healthcare app should not treat camera-derived signals the same way. But the architecture is reusable: video processor, state classifier, context injection, expressive voice, avatar renderer.
What should builders be careful about?
Do not overstate what the perception model knows.
MediaPipe blendshapes can tell you that a mouth shape looks like a smile or that a head is angled down. They cannot tell you what a person is feeling with certainty. "Looks sad" is a prompt signal, not a diagnosis.
That should show up in the product design.
Use camera state to adapt tone. Use it to choose shorter answers, gentler pacing, or a re-engagement nudge. Do not use it to make hard claims about the user. Do not expose raw labels in the UI unless the user expects that. Do not store more than you need. Make camera use obvious and consent-based.
The best implementation constraint in this demo is also the most important one: the agent never narrates the surveillance. It does not say, "I can see you're sad." It just responds with more care.
That is the difference between an agent that feels aware and an agent that feels invasive.
What does this mean for video agents?
The next phase of agents is bigger than text plus voice. It is models that can operate inside live video.
Most voice frameworks started with audio and added video later. Vision Agents starts with video as a primitive. That means face tracking can run next to pose detection, object detection, scene understanding, or any other processor, each at its own frame rate.
Face state is one version of context. A tutoring agent might read the whiteboard. A repair agent might see the broken part. A sales coach might track posture and eye contact. A healthcare intake agent might use visual state only for accessibility and human handoff.
Once the real-time avatar is part of the same media loop, the agent can see, speak, and be seen. That is the shape this work points toward.
For more context on the product side of this shift, the posts on conversational video AI and Cara 3 interactive avatars are the best next reads.
Frequently asked questions
What is an emotion-aware AI agent?
An emotion-aware AI agent uses signals such as tone, facial expression, gaze, or engagement to adapt its behavior during a conversation. Those signals should guide the response, not be treated as certain proof of what a user feels.
How does an interactive avatar make a voice agent feel more present?
An interactive avatar gives a voice agent a live visual interface with lip sync, facial motion, and turn-taking cues. That makes the exchange feel closer to a video call than a speaker attached to a chatbot.
What does MediaPipe do in an emotion-aware agent?
MediaPipe Face Landmarker processes the user's video feed and outputs face landmarks, blendshape scores, and transformation matrices. A lightweight classifier can turn those signals into coarse labels such as happy, sad, distracted, or looking at camera.
Why use Inworld TTS-2 for this kind of demo?
Inworld TTS-2 supports natural-language steering, so the LLM can direct the voice with notes like "say gently with a quiet pace." That lets the agent adapt delivery as well as wording.
Where does Anam fit in the Vision Agents pipeline?
Anam provides the real-time avatar renderer. In Vision Agents, AnamAvatarPublisher() receives the agent's speech output and publishes synchronized avatar video back into the call.
When should builders use Stream instead of Anam?
Use Stream and Vision Agents for real-time call transport, video processing, event routing, and agent orchestration. Use Anam when that agent needs to appear as a live, lip-synced avatar in the session.
Can Anam work with different LLM, STT, and TTS providers?
Yes. Anam is designed as the visual interface layer for live agents, so it can sit on top of different language models, speech-to-text systems, and text-to-speech engines.
Should an agent tell users it is reading their facial expressions?
The product should be transparent that camera input is being used, but the agent should not constantly narrate inferred states back to the user. A good pattern is to let perception shape tone and timing while keeping consent, privacy, and escalation rules explicit in the product.
Explore more articles
© 2026 Anam Labs
HIPAA & SOC-II Certified





