Developers

Pricing

Resources

Company

Team

Start building

Resources

How to Build an AI Voice Agent That Has a Face

Ben Carr

Mar 19, 2026

Your voice agent works. Users can talk to it, it understands them, it responds. But there's a version of that same experience that performs measurably better, and it doesn't require rebuilding anything.

Add a face.

This is a practical guide to doing exactly that. We'll cover the architecture, the ElevenLabs integration, what works with Retell and Vapi, and when it's actually worth the effort.

Why a face changes the outcome

A January 2026 benchmark from Mabyduck tested real users across the leading interactive avatar platforms and measured overall experience. Responsiveness was the strongest predictor of positive ratings, with a Spearman rho of 0.697 (p < 0.001). But visual presence was the second-strongest factor. Not the voice quality. Not the LLM intelligence. The face.

That benchmark matters because it's the first large-scale comparison across platforms with proper statistical controls. And Anam ranked first.

The practical implication: in contexts where trust or emotional connection is relevant, a voice agent with no visual presence is leaving measurable gains on the table. Users are less likely to complete the interaction, less likely to rate it positively, and less likely to return.

What "a face" actually means

There's an important distinction to make here. Pre-recorded video tools like HeyGen or Synthesia are useful for content production. You give them a script; they return a video file. That's not what we're talking about.

A real-time interactive ai avatar is different. It:

listens and reacts as the user speaks, with no pre-scripted content
responds with sub-200ms end-to-end latency (below the threshold at which conversation feels live)
handles interruptions and turn-taking the way a human does
runs natively over WebRTC in any modern browser, no plugins required

Anam's CARA III model achieves sub-200ms server-side latency and topped the Mabyduck benchmark across all metrics. The latency figure is the one that matters most. Above one second, users perceive lag. Below it, the conversation feels natural. This is well-established in human-computer interaction research, and the benchmark results confirm it holds for AI avatars too.

Architecture: how it works

The reason adding a face doesn't require rebuilding your existing voice agent is that the two concerns are decoupled.

Your agent already handles the hard parts: speech-to-text, the LLM call, text-to-speech output. Anam consumes that audio output and renders a face that speaks it, in real time. Your ElevenLabs pipeline, your Retell setup, your custom WebSocket server, none of that changes.

The relevant API mode is `CUSTOMER_CLIENT_V1`, which is Anam's bring-your-own-LLM configuration. You supply the audio stream; Anam handles the visual layer. This means you can keep your existing LLM, voice, and business logic entirely intact. The avatar is a rendering layer, not a replacement for anything you've already built.

Full documentation is at docs.anam.ai/quickstart. For a worked open-source example of this pattern, the clawd-face project shows how we added Anam to an existing Claude Code agent. The post about that build is here if you want the narrative version.

Step-by-step: adding a face to your ElevenLabs agent

This takes about 30 minutes the first time.

1. Get an API key

Sign up at lab.anam.ai and grab your API key from the dashboard. You can test the interactive demo there too, no code required.

2. Install the SDK

bash
npm

bash
npm

bash
npm

3. Initialise the avatar session

import { createClient } from "@anam-ai/js-sdk";
const anam = createClient("your-api-key", {
  personaId: "your-persona-id",
});
await anam.streamToVideoAndAudioElements(
  videoElement,
  audioElement
);

import { createClient } from "@anam-ai/js-sdk";
const anam = createClient("your-api-key", {
  personaId: "your-persona-id",
});
await anam.streamToVideoAndAudioElements(
  videoElement,
  audioElement
);

import { createClient } from "@anam-ai/js-sdk";
const anam = createClient("your-api-key", {
  personaId: "your-persona-id",
});
await anam.streamToVideoAndAudioElements(
  videoElement,
  audioElement
);

At this point the avatar is live in your UI and waiting for audio input.

4. Route your ElevenLabs audio to Anam

Instead of playing the ElevenLabs TTS output directly, pipe it through the Anam session using `CUSTOMER_CLIENT_V1` mode. The SDK accepts the audio stream and drives the avatar's lip sync and expression from it. Your ElevenLabs voice, your LLM, unchanged.

5. That's it

The avatar speaks with your ElevenLabs voice agent's output, reacts to the user in real time, and handles interruptions naturally. Full API reference is at anam.ai/docs.

Works with Retell, Vapi, and Pipecat too

The same pattern applies to any voice stack, because the Anam SDK accepts any audio stream regardless of what generated it.

Retell: connect Anam to Retell's audio output via WebSocket
Vapi: pipe Vapi's TTS stream into the Anam SDK
Pipecat: install pipecat-anam and configure it in your pipeline
Custom stacks: anything that produces an audio stream works with `CUSTOMER_CLIENT_V1` mode

This also means you can swap out your LLM or TTS provider later without touching the avatar layer. The decoupled design keeps both sides independently upgradeable.

When it matters vs when it doesn't

A face is not always the right call. Here's a simple framework.

Worth it when:

the conversation involves trust, and a visual presence changes what the user shares (healthcare intake, financial advice, HR)
you're running a product demo or sales qualification flow, where the avatar signals that someone is genuinely "there"
education or onboarding flows, where users retain more from audio when they can see a face

Not worth it when:

it's phone or IVR, where the user has no screen
the interaction is transactional and speed is the goal (balance check, booking a slot), because the avatar adds setup without adding value
it's an internal tool where your team uses voice for quick lookups and doesn't care about rapport

The honest test: would a human in that role build trust visually with a user? If yes, an ai voice avatar with face helps. If not, it won't.

Frequently asked questions

Does adding a face increase latency?

In most integrations, no. The avatar renders on the client side, and the audio pipeline runs unchanged. End-to-end latency with CARA III is sub-200ms. The benchmark data from Mabyduck confirms Anam is faster than any other platform currently in the market.

Can I use my existing ElevenLabs agent?

Yes. Anam's CUSTOMER_CLIENT_V1 mode is designed exactly for this. You keep your ElevenLabs voice, your LLM calls, your prompt engineering. Anam receives the audio output and adds the visual layer. Nothing in your existing pipeline changes.

What avatars are available?

You can choose from a library of built-in avatars or upload a custom one. Custom avatars are supported for enterprise plans. The full list of interactive avatars is on the website.

How much does it cost?

Pricing is usage-based. You can start for free at lab.anam.ai and the free tier covers enough to build and test an integration. For production pricing, book a call and we'll walk through options based on your expected volume.

Does this work on mobile?

Yes. The Anam SDK uses WebRTC, which is supported natively in mobile browsers on both iOS and Android. No app install required.

What happens if the user interrupts the avatar?

Interruption handling is built into the SDK. When the user starts speaking, the avatar stops mid-sentence and the session hands back to the user, the same way a human would respond to being interrupted. This is one of the areas where sub-200ms latency makes a real difference to perceived naturalness.

Is there a free trial?

Yes. lab.anam.ai lets you test the avatar experience in your browser immediately, no signup needed. For API access, sign up and the free tier covers integration and testing.

Try it

If you've built a voice agent and want to see what an ai voice agent visual layer adds:

lab.anam.ai — try it in your browser right now
docs.anam.ai/quickstart — the integration guide
github.com/anam-org/clawd-face — open-source example of the full pattern

For production use cases, book a demo and we'll walk through the architecture together.

Adding a face to your voice agent is a one-time integration that takes an afternoon. The benchmark data is now clear that it changes measurable outcomes in the right contexts. The question is just whether those contexts match what you're building.

What does your voice agent do, and would a face help?

Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content