ElevenLabs Conversational AI with Anam avatars

ElevenLabs Conversational AI gives you voice agents with natural speech recognition and synthesis. But voice-only agents can feel disembodied. By adding an Anam avatar, you give your agent a face that moves in sync with its speech.

This integration uses Anam's audio passthrough mode. Instead of using Anam's built-in STT/LLM/TTS pipeline, you send audio directly from ElevenLabs to the avatar for lip-syncing:

User voice → ElevenLabs agent → Audio response → Anam avatar → User sees talking avatar

ElevenLabs handles the conversation intelligence and voice synthesis. Anam renders the visual avatar synchronized to the audio.

The complete code is at anam-org/elevenlabs_agent_demo.

What you'll build

A web application that:

Connects to an ElevenLabs Conversational AI agent via WebSocket
Displays an Anam avatar that lip-syncs to the agent's responses
Captures microphone input and sends it to ElevenLabs
Handles interruptions when the user speaks over the agent

Prerequisites

Bun runtime (or Node.js 18+)
An ElevenLabs account with a Conversational AI agent configured
An Anam account with API key from lab.anam.ai

Project setup

Clone the demo repository:

git clone https://github.com/anam-org/elevenlabs_agent_demo.git
cd elevenlabs_agent_demo
bun install

Create a .dev.vars file with your credentials:

ANAM_API_KEY=your_anam_api_key
ANAM_AVATAR_ID=your_avatar_id
ELEVENLABS_AGENT_ID=your_agent_id

You can find avatar IDs at lab.anam.ai/avatars. For the ElevenLabs agent ID, go to your agent in the ElevenLabs dashboard and copy the ID from the URL or settings.

Start the development server:

bun run dev

Open http://localhost:5173 and click "Start Conversation" to try it out.

Project structure

The demo has a simple structure:

src/
├── client.ts         # Main orchestration logic
├── elevenlabs.ts     # WebSocket and microphone handling
├── index.ts          # Hono server entry point
├── renderer.tsx      # HTML template
└── routes/
    ├── index.tsx     # UI page
    └── api/config.ts # Session token endpoint

The server creates Anam session tokens so API keys stay server-side. The client orchestrates the connection between ElevenLabs and Anam.

Server-side: creating session tokens

The /api/config endpoint creates an Anam session token with audio passthrough enabled:

// src/routes/api/config.ts
const response = await fetch("https://api.anam.ai/v1/auth/session-token", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: `Bearer ${env.ANAM_API_KEY}`,
  },
  body: JSON.stringify({
    avatarId: env.ANAM_AVATAR_ID,
    enableAudioPassthrough: true,
  }),
});

const { sessionToken } = await response.json();
return c.json({
  anamSessionToken: sessionToken,
  elevenLabsAgentId: env.ELEVENLABS_AGENT_ID,
});

The enableAudioPassthrough: true flag is required. Without it, the avatar won't accept external audio input.

Client-side: the main orchestration

The client connects both services and wires them together. Let's walk through src/client.ts.

Initializing the Anam client

When the user clicks "Start Conversation", we fetch the config and create the Anam client:

import { createClient } from "@anam-ai/js-sdk";
import { connectElevenLabs, stopElevenLabs } from "./elevenlabs";

async function start() {
  // Fetch config from server
  const res = await fetch("/api/config");
  const config = await res.json();

  // Initialize Anam avatar
  anamClient = createClient(config.anamSessionToken, {
    disableInputAudio: true,  // ElevenLabs handles the microphone
  });
  await anamClient.streamToVideoElement("anam-video");

The disableInputAudio: true option tells Anam not to capture microphone input. ElevenLabs handles speech recognition, so we don't want Anam listening too.

Setting up the audio stream

Next, we create the audio input stream that will receive ElevenLabs audio:

const agentAudioInputStream = anamClient.createAgentAudioInputStream({
    encoding: "pcm_s16le",
    sampleRate: 16000,
    channels: 1,
  });

The audio format must match what ElevenLabs sends: PCM 16-bit signed little-endian at 16kHz mono. If these don't match, the lip-sync will be wrong or won't work at all.

Connecting to ElevenLabs

Now we connect to ElevenLabs and wire up the callbacks:

await connectElevenLabs(config.elevenLabsAgentId, {
    onReady: () => {
      setConnected(true);
      addMessage("system", "Connected. Start speaking...");
    },
    onAudio: (audio) => {
      agentAudioInputStream.sendAudioChunk(audio);
    },
    onUserTranscript: (text) => addMessage("user", text),
    onAgentResponse: (text) => {
      agentAudioInputStream.endSequence();
      addMessage("agent", text);
    },
    onInterrupt: () => {
      anamClient?.interruptPersona();
      agentAudioInputStream.endSequence();
    },
    onDisconnect: () => setConnected(false),
    onError: () => showError("Connection error"),
  });
}

Each callback handles a different event:

onAudio receives base64-encoded audio chunks from ElevenLabs and forwards them to Anam for lip-sync
onAgentResponse fires when the agent finishes speaking, so we call endSequence() to signal completion
onInterrupt fires when the user speaks over the agent (barge-in), so we stop the avatar mid-speech

The ElevenLabs module

The src/elevenlabs.ts module handles the WebSocket connection and microphone capture. Let's look at the key parts.

Connecting to the WebSocket

ElevenLabs Conversational AI uses a WebSocket for bidirectional audio streaming:

import { MicrophoneCapture } from "chatdio";

let ws: WebSocket | null = null;
let microphone: MicrophoneCapture | null = null;

export async function connectElevenLabs(
  agentId: string,
  callbacks: ElevenLabsCallbacks
) {
  ws = new WebSocket(
    `wss://api.elevenlabs.io/v1/convai/conversation?agent_id=${agentId}`
  );

  ws.onopen = async () => {
    // Start microphone capture once connected
    microphone = new MicrophoneCapture({
      echoCancellation: true,
      noiseSuppression: true,
      autoGainControl: true,
      sampleRate: 16000,
    });

    microphone.on("audio", (audioData: ArrayBuffer) => {
      if (ws?.readyState === WebSocket.OPEN) {
        const base64 = btoa(String.fromCharCode(...new Uint8Array(audioData)));
        ws.send(JSON.stringify({ user_audio_chunk: base64 }));
      }
    });

    await microphone.start();
  };

The chatdio library provides MicrophoneCapture with echo cancellation built in. This prevents the avatar's audio from feeding back into the microphone.

Handling messages

The WebSocket receives different message types from ElevenLabs:

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);

    switch (message.type) {
      case "conversation_initiation_metadata":
        callbacks.onReady();
        break;

      case "audio":
        // Forward audio to Anam for lip-sync
        callbacks.onAudio(message.audio_event.audio_base_64);
        break;

      case "agent_response":
        callbacks.onAgentResponse(message.agent_response_event.agent_response);
        break;

      case "user_transcript":
        callbacks.onUserTranscript(message.user_transcription_event.user_transcript);
        break;

      case "interruption":
        callbacks.onInterrupt();
        break;

      case "ping":
        ws?.send(JSON.stringify({ type: "pong" }));
        break;
    }
  };

The audio messages contain base64-encoded PCM chunks. We pass these directly to Anam via the onAudio callback.

Cleaning up

When the conversation ends, we close everything:

export function stopElevenLabs() {
  microphone?.stop();
  microphone = null;

  if (ws?.readyState === WebSocket.OPEN) {
    ws.close();
  }
  ws = null;
}

Handling interruptions

When users speak while the agent is talking (barge-in), ElevenLabs sends an interruption event. The client handles this by stopping the avatar immediately:

onInterrupt: () => {
  anamClient?.interruptPersona();
  agentAudioInputStream.endSequence();
},

The interruptPersona() method stops any queued audio and resets the avatar to its idle state. Without this, the avatar would continue lip-syncing to audio that's no longer playing.

Audio format requirements

ElevenLabs outputs PCM 16-bit signed little-endian audio at 16kHz mono. The Anam audio stream must be configured to match:

anamClient.createAgentAudioInputStream({
  encoding: "pcm_s16le",  // PCM 16-bit signed little-endian
  sampleRate: 16000,       // 16kHz
  channels: 1,             // Mono
});

If you're adapting this for a different voice provider, check their audio output format and adjust accordingly.

Deploying to Cloudflare Workers

The demo is set up for Cloudflare Workers deployment:

bun run deploy

Before deploying, set your environment variables in Cloudflare:

wrangler secret put ANAM_API_KEY
wrangler secret put ANAM_AVATAR_ID
wrangler secret put ELEVENLABS_AGENT_ID

Troubleshooting

Avatar lips not syncing:

Verify the audio format matches: pcm_s16le, 16kHz, mono
Check that enableAudioPassthrough: true was set when creating the session token
Make sure createAgentAudioInputStream() is called before sending audio

No audio from ElevenLabs:

Verify your ElevenLabs agent ID is correct
Check that the WebSocket is connected before sending microphone data
Confirm your ElevenLabs account has Conversational AI access

Echo or feedback:

The chatdio library should handle echo cancellation automatically
Make sure you're using disableInputAudio: true on the Anam client so it doesn't also capture the microphone