How to Build an AI Voice Agent That Has a Face
Your voice agent works. Users can talk to it, it understands them, it responds. But there's a version of that same experience that performs measurably better, and it doesn't require rebuilding anything.
Add a face.
This is a practical guide to doing exactly that. We'll cover the architecture, the ElevenLabs integration, what works with Retell and Vapi, and when it's actually worth the effort.
Why a face changes the outcome
A January 2026 benchmark from Mabyduck tested real users across the leading interactive avatar platforms and measured overall experience. Responsiveness was the strongest predictor of positive ratings, with a Spearman rho of 0.697 (p < 0.001). But visual presence was the second-strongest factor. Not the voice quality. Not the LLM intelligence. The face.
That benchmark matters because it's the first large-scale comparison across platforms with proper statistical controls. And Anam ranked first.
The practical implication: in contexts where trust or emotional connection is relevant, a voice agent with no visual presence is leaving measurable gains on the table. Users are less likely to complete the interaction, less likely to rate it positively, and less likely to return.
What "a face" actually means
There's an important distinction to make here. Pre-recorded video tools like HeyGen or Synthesia are useful for content production. You give them a script; they return a video file. That's not what we're talking about.
A real-time interactive ai avatar is different. It:
listens and reacts as the user speaks, with no pre-scripted content
responds with sub-200ms end-to-end latency (below the threshold at which conversation feels live)
handles interruptions and turn-taking the way a human does
runs natively over WebRTC in any modern browser, no plugins required
Anam's CARA III model achieves sub-200ms server-side latency and topped the Mabyduck benchmark across all metrics. The latency figure is the one that matters most. Above one second, users perceive lag. Below it, the conversation feels natural. This is well-established in human-computer interaction research, and the benchmark results confirm it holds for AI avatars too.
Architecture: how it works
The reason adding a face doesn't require rebuilding your existing voice agent is that the two concerns are decoupled.
Your agent already handles the hard parts: speech-to-text, the LLM call, text-to-speech output. Anam consumes that audio output and renders a face that speaks it, in real time. Your ElevenLabs pipeline, your Retell setup, your custom WebSocket server, none of that changes.
The relevant API mode is `CUSTOMER_CLIENT_V1`, which is Anam's bring-your-own-LLM configuration. You supply the audio stream; Anam handles the visual layer. This means you can keep your existing LLM, voice, and business logic entirely intact. The avatar is a rendering layer, not a replacement for anything you've already built.
Full documentation is at docs.anam.ai/quickstart. For a worked open-source example of this pattern, the clawd-face project shows how we added Anam to an existing Claude Code agent. The post about that build is here if you want the narrative version.
Step-by-step: adding a face to your ElevenLabs agent
This takes about 30 minutes the first time.
1. Get an API key
Sign up at lab.anam.ai and grab your API key from the dashboard. You can test the interactive demo there too, no code required.
2. Install the SDK
3. Initialise the avatar session
At this point the avatar is live in your UI and waiting for audio input.
4. Route your ElevenLabs audio to Anam
Instead of playing the ElevenLabs TTS output directly, pipe it through the Anam session using `CUSTOMER_CLIENT_V1` mode. The SDK accepts the audio stream and drives the avatar's lip sync and expression from it. Your ElevenLabs voice, your LLM, unchanged.
5. That's it
The avatar speaks with your ElevenLabs voice agent's output, reacts to the user in real time, and handles interruptions naturally. Full API reference is at anam.ai/docs.
Works with Retell, Vapi, and Pipecat too
The same pattern applies to any voice stack, because the Anam SDK accepts any audio stream regardless of what generated it.
Retell: connect Anam to Retell's audio output via WebSocket
Vapi: pipe Vapi's TTS stream into the Anam SDK
Pipecat: install pipecat-anam and configure it in your pipeline
Custom stacks: anything that produces an audio stream works with `CUSTOMER_CLIENT_V1` mode
This also means you can swap out your LLM or TTS provider later without touching the avatar layer. The decoupled design keeps both sides independently upgradeable.
When it matters vs when it doesn't
A face is not always the right call. Here's a simple framework.
Worth it when:
the conversation involves trust, and a visual presence changes what the user shares (healthcare intake, financial advice, HR)
you're running a product demo or sales qualification flow, where the avatar signals that someone is genuinely "there"
education or onboarding flows, where users retain more from audio when they can see a face
Not worth it when:
it's phone or IVR, where the user has no screen
the interaction is transactional and speed is the goal (balance check, booking a slot), because the avatar adds setup without adding value
it's an internal tool where your team uses voice for quick lookups and doesn't care about rapport
The honest test: would a human in that role build trust visually with a user? If yes, an ai voice avatar with face helps. If not, it won't.
Frequently asked questions
Does adding a face increase latency?
In most integrations, no. The avatar renders on the client side, and the audio pipeline runs unchanged. End-to-end latency with CARA III is sub-200ms. The benchmark data from Mabyduck confirms Anam is faster than any other platform currently in the market.
Can I use my existing ElevenLabs agent?
Yes. Anam's CUSTOMER_CLIENT_V1 mode is designed exactly for this. You keep your ElevenLabs voice, your LLM calls, your prompt engineering. Anam receives the audio output and adds the visual layer. Nothing in your existing pipeline changes.
What avatars are available?
You can choose from a library of built-in avatars or upload a custom one. Custom avatars are supported for enterprise plans. The full list of interactive avatars is on the website.
How much does it cost?
Pricing is usage-based. You can start for free at lab.anam.ai and the free tier covers enough to build and test an integration. For production pricing, book a call and we'll walk through options based on your expected volume.
Does this work on mobile?
Yes. The Anam SDK uses WebRTC, which is supported natively in mobile browsers on both iOS and Android. No app install required.
What happens if the user interrupts the avatar?
Interruption handling is built into the SDK. When the user starts speaking, the avatar stops mid-sentence and the session hands back to the user, the same way a human would respond to being interrupted. This is one of the areas where sub-200ms latency makes a real difference to perceived naturalness.
Is there a free trial?
Yes. lab.anam.ai lets you test the avatar experience in your browser immediately, no signup needed. For API access, sign up and the free tier covers integration and testing.
Try it
If you've built a voice agent and want to see what an ai voice agent visual layer adds:
lab.anam.ai — try it in your browser right now
docs.anam.ai/quickstart — the integration guide
github.com/anam-org/clawd-face — open-source example of the full pattern
For production use cases, book a demo and we'll walk through the architecture together.
Adding a face to your voice agent is a one-time integration that takes an afternoon. The benchmark data is now clear that it changes measurable outcomes in the right contexts. The question is just whether those contexts match what you're building.
What does your voice agent do, and would a face help?
© 2026 Anam Labs
HIPAA & SOC-II Certified