Speech detection events for better conversational AI design

·

There's a gap in most conversational AI design that nobody talks about. It's the dead air between when a user starts speaking and when the system acknowledges it.

You say something. The avatar stares at you. A second passes, maybe two. Then suddenly a transcript appears and the system snaps to attention. That delay isn't a processing problem. The system heard you immediately. It just didn't tell you.

The Anam JS SDK (v4.12.0+) and Python SDK now emit userSpeechStarted and userSpeechEnded events the instant server-side voice activity detection fires, before any transcription happens. Your UI can react the moment someone opens their mouth, not when the transcript arrives.


Why this gap breaks conversational AI design

Human conversation runs on real-time feedback loops. When you talk to someone, they nod, shift their gaze, lean in. These micro-signals happen within 200ms of hearing speech. Remove them and the conversation feels off, like talking to someone wearing headphones.

Most avatar SDKs only expose transcript events. The transcript arrives after the speech-to-text model finishes processing, which takes anywhere from 500ms to 2 seconds depending on the utterance length. During that entire window, your UI does nothing. The user has no idea if the system heard them, if their mic is working, or if they should speak louder.

Speech detection events close that gap. They fire the moment Anam's server-side Deepgram VAD detects voice energy, typically within 100-200ms of the user starting to speak. Use them to show a "listening" indicator, animate the avatar's attention, suppress other UI elements, or do anything else that signals acknowledgment.


The events

Both events pass a correlationId string that links a speech segment to its eventual transcript.


JavaScript

import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

const client = new AnamClient("your-session-token");

client.addListener(
  AnamEvent.USER_SPEECH_STARTED,
  (correlationId: string) => {
    console.log("User speaking:", correlationId);
  }
);

client.addListener(
  AnamEvent.USER_SPEECH_ENDED,
  (correlationId: string) => {
    console.log("User stopped:", correlationId);
  }
);
import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

const client = new AnamClient("your-session-token");

client.addListener(
  AnamEvent.USER_SPEECH_STARTED,
  (correlationId: string) => {
    console.log("User speaking:", correlationId);
  }
);

client.addListener(
  AnamEvent.USER_SPEECH_ENDED,
  (correlationId: string) => {
    console.log("User stopped:", correlationId);
  }
);
import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

const client = new AnamClient("your-session-token");

client.addListener(
  AnamEvent.USER_SPEECH_STARTED,
  (correlationId: string) => {
    console.log("User speaking:", correlationId);
  }
);

client.addListener(
  AnamEvent.USER_SPEECH_ENDED,
  (correlationId: string) => {
    console.log("User stopped:", correlationId);
  }
);


Python

from anam import AnamClient, AnamEvent

client = AnamClient(api_key="your-api-key", persona_id="your-persona-id")

@client.on(AnamEvent.USER_SPEECH_STARTED)
async def on_speech_started(correlation_id: str):
    print(f"User speaking: {correlation_id}")

@client.on(AnamEvent.USER_SPEECH_ENDED)
async def on_speech_ended(correlation_id: str):
    print(f"User stopped: {correlation_id}")
from anam import AnamClient, AnamEvent

client = AnamClient(api_key="your-api-key", persona_id="your-persona-id")

@client.on(AnamEvent.USER_SPEECH_STARTED)
async def on_speech_started(correlation_id: str):
    print(f"User speaking: {correlation_id}")

@client.on(AnamEvent.USER_SPEECH_ENDED)
async def on_speech_ended(correlation_id: str):
    print(f"User stopped: {correlation_id}")
from anam import AnamClient, AnamEvent

client = AnamClient(api_key="your-api-key", persona_id="your-persona-id")

@client.on(AnamEvent.USER_SPEECH_STARTED)
async def on_speech_started(correlation_id: str):
    print(f"User speaking: {correlation_id}")

@client.on(AnamEvent.USER_SPEECH_ENDED)
async def on_speech_ended(correlation_id: str):
    print(f"User stopped: {correlation_id}")


The correlationId connects the speech detection to the transcript that arrives later via MESSAGE_STREAM_EVENT_RECEIVED. Match the IDs and you can track the full lifecycle: user starts speaking, user stops, transcript arrives, avatar responds. This same Python event system works if you're building with Pipecat or other agent frameworks.


Building a listening indicator

The most immediate use is a visual cue that appears the instant the user speaks. Here's a React implementation:

import { useEffect, useState } from "react";
import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

function ListeningIndicator({ client }: { client: AnamClient }) {
  const [isListening, setIsListening] = useState(false);

  useEffect(() => {
    const onStart = () => setIsListening(true);
    const onEnd = () => setIsListening(false);

    client.addListener(AnamEvent.USER_SPEECH_STARTED, onStart);
    client.addListener(AnamEvent.USER_SPEECH_ENDED, onEnd);

    return () => {
      client.removeListener(AnamEvent.USER_SPEECH_STARTED, onStart);
      client.removeListener(AnamEvent.USER_SPEECH_ENDED, onEnd);
    };
  }, [client]);

  if (!isListening) return null;

  return (
    <div className="listening-indicator">
      <span className="pulse" />
      Listening...
    </div>
  );
}
import { useEffect, useState } from "react";
import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

function ListeningIndicator({ client }: { client: AnamClient }) {
  const [isListening, setIsListening] = useState(false);

  useEffect(() => {
    const onStart = () => setIsListening(true);
    const onEnd = () => setIsListening(false);

    client.addListener(AnamEvent.USER_SPEECH_STARTED, onStart);
    client.addListener(AnamEvent.USER_SPEECH_ENDED, onEnd);

    return () => {
      client.removeListener(AnamEvent.USER_SPEECH_STARTED, onStart);
      client.removeListener(AnamEvent.USER_SPEECH_ENDED, onEnd);
    };
  }, [client]);

  if (!isListening) return null;

  return (
    <div className="listening-indicator">
      <span className="pulse" />
      Listening...
    </div>
  );
}
import { useEffect, useState } from "react";
import { AnamClient, AnamEvent } from "@anam-ai/js-sdk";

function ListeningIndicator({ client }: { client: AnamClient }) {
  const [isListening, setIsListening] = useState(false);

  useEffect(() => {
    const onStart = () => setIsListening(true);
    const onEnd = () => setIsListening(false);

    client.addListener(AnamEvent.USER_SPEECH_STARTED, onStart);
    client.addListener(AnamEvent.USER_SPEECH_ENDED, onEnd);

    return () => {
      client.removeListener(AnamEvent.USER_SPEECH_STARTED, onStart);
      client.removeListener(AnamEvent.USER_SPEECH_ENDED, onEnd);
    };
  }, [client]);

  if (!isListening) return null;

  return (
    <div className="listening-indicator">
      <span className="pulse" />
      Listening...
    </div>
  );
}


The indicator appears the moment the user starts talking and disappears when they stop. The cleanup in the useEffect return prevents memory leaks if the component unmounts mid-conversation.


Three patterns worth building

A boolean "listening or not" state is a start. Here are three patterns that make the interaction feel more considered.


Conversation state machine

Track where you are in the full conversation cycle:

type ConversationState = "idle" | "listening" | "processing" | "responding";

let state: ConversationState = "idle";

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  state = "listening";
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  state = "processing";
});

client.addListener(AnamEvent.MESSAGE_STREAM_EVENT_RECEIVED, () => {
  state = "responding";
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  state = "idle";
});
type ConversationState = "idle" | "listening" | "processing" | "responding";

let state: ConversationState = "idle";

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  state = "listening";
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  state = "processing";
});

client.addListener(AnamEvent.MESSAGE_STREAM_EVENT_RECEIVED, () => {
  state = "responding";
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  state = "idle";
});
type ConversationState = "idle" | "listening" | "processing" | "responding";

let state: ConversationState = "idle";

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  state = "listening";
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  state = "processing";
});

client.addListener(AnamEvent.MESSAGE_STREAM_EVENT_RECEIVED, () => {
  state = "responding";
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  state = "idle";
});


Each state drives different UI. "Listening" shows an active mic indicator. "Processing" shows a thinking animation. "Responding" highlights the interactive avatar. "Idle" returns to the default state. Four states, four distinct visual treatments, and the transitions happen at the right moment instead of all being lumped together when the transcript arrives.


Interruption feedback

When the user speaks while the avatar is talking, that's an interruption. The speech detection events let you catch this before the system processes the interrupt:

let avatarIsSpeaking = false;

client.addListener(AnamEvent.TALK_STREAM_STARTED, () => {
  avatarIsSpeaking = true;
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  avatarIsSpeaking = false;
});

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  if (avatarIsSpeaking) {
    showInterruptFeedback();
  }
});
let avatarIsSpeaking = false;

client.addListener(AnamEvent.TALK_STREAM_STARTED, () => {
  avatarIsSpeaking = true;
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  avatarIsSpeaking = false;
});

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  if (avatarIsSpeaking) {
    showInterruptFeedback();
  }
});
let avatarIsSpeaking = false;

client.addListener(AnamEvent.TALK_STREAM_STARTED, () => {
  avatarIsSpeaking = true;
});

client.addListener(AnamEvent.TALK_STREAM_ENDED, () => {
  avatarIsSpeaking = false;
});

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  if (avatarIsSpeaking) {
    showInterruptFeedback();
  }
});


Anam handles the actual interruption on the backend. The avatar stops, the pipeline resets, and the new user input gets processed. But without client-side events, the user sees nothing during the 200-400ms it takes for the interrupt to propagate. A quick visual flash ("switching to you") makes that transition feel intentional rather than glitchy.


Timeout safety net

If USER_SPEECH_ENDED never arrives because the connection dropped mid-speech, you don't want the UI stuck in a "listening" state:

let speechTimeout: ReturnType<typeof setTimeout>;

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  clearTimeout(speechTimeout);
  setIsListening(true);

  speechTimeout = setTimeout(() => {
    setIsListening(false);
    showMicWarning();
  }, 10_000);
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  clearTimeout(speechTimeout)

let speechTimeout: ReturnType<typeof setTimeout>;

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  clearTimeout(speechTimeout);
  setIsListening(true);

  speechTimeout = setTimeout(() => {
    setIsListening(false);
    showMicWarning();
  }, 10_000);
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  clearTimeout(speechTimeout)

let speechTimeout: ReturnType<typeof setTimeout>;

client.addListener(AnamEvent.USER_SPEECH_STARTED, () => {
  clearTimeout(speechTimeout);
  setIsListening(true);

  speechTimeout = setTimeout(() => {
    setIsListening(false);
    showMicWarning();
  }, 10_000);
});

client.addListener(AnamEvent.USER_SPEECH_ENDED, () => {
  clearTimeout(speechTimeout)


The 10-second timeout is based on a practical edge case noted in the docs: if the WebRTC connection drops mid-speech, the server-side VAD stops firing but the client never receives the end event. Reset the UI and optionally warn the user about their mic.


How detection works under the hood

These events don't use the browser's Web Audio API or a local VAD model. The detection runs on Anam's server using Deepgram's voice activity detection. Your microphone audio travels over the WebRTC connection to Anam's backend, the VAD fires there, and the event comes back over the signaling channel.

This means the events reflect what the server actually received, not what the browser thinks it heard. If the network drops, the events stop because the server genuinely isn't getting audio. A local VAD would keep firing on ambient noise with no way to know the server was disconnected.

The tradeoff is latency. A local VAD fires in under 50ms. The server-side round-trip puts you at 100-200ms depending on connection quality. For most interactive avatar applications, that's fast enough. The difference between 50ms and 150ms isn't perceptible to users. What is perceptible is the 1-2 second wait for a transcript that you'd have without these events.


What not to do

Don't use these events to gate user input or suppress the microphone. They're signals for UI feedback, not audio controls. The server is always receiving audio regardless of your indicator state.

Don't build your own endpointing logic on top of them. The VAD fires on voice energy, not on semantic boundaries. A user pausing for breath triggers USER_SPEECH_ENDED followed by another USER_SPEECH_STARTED a moment later. Deciding when the user is actually done talking is a separate problem that Anam handles server-side with its endpointing model.

Don't assume a 1:1 mapping between speech segments and transcripts. A user might speak, pause, and resume before the STT model returns a single merged transcript. The correlationId helps you trace which segment produced which transcript, but the relationship isn't always one start-end pair per transcript.


FAQ

What's the difference between speech detection events and transcript events?

Speech detection events (USER_SPEECH_STARTED, USER_SPEECH_ENDED) fire the moment voice activity is detected, typically within 100-200ms. Transcript events arrive after the speech-to-text model finishes processing, which can take 500ms to 2 seconds. Use speech detection for instant UI feedback. Use transcripts when you need the actual words.

Do these events work with Custom LLM and Custom STT setups?

Yes. The speech detection runs on Anam's server-side VAD regardless of your pipeline configuration. Whether you're using Anam's full turnkey pipeline, bringing your own LLM, or bringing your own STT, the USER_SPEECH_STARTED and USER_SPEECH_ENDED events fire the same way. The VAD operates on the raw audio before it reaches any STT model.

Can I use these events without an interactive avatar?

The events are part of Anam's session infrastructure, so they require an active avatar session. They're designed for building responsive UI around real-time avatar conversations. If you're building a voice-only agent without a visual component, a client-side VAD library would be a better fit.

How do I handle rapid speech starts and stops?

The VAD fires on voice energy, so brief pauses mid-sentence can trigger a quick USER_SPEECH_ENDED followed by USER_SPEECH_STARTED. For UI indicators, consider adding a small debounce (200-300ms) before hiding the listening state. This prevents the indicator from flickering during natural speech pauses without noticeably delaying the initial appearance.

Which SDK versions support these events?

The JavaScript SDK supports them from v4.12.0 onward. The Python SDK supports them in the current release. Both use the same underlying server-side VAD, so the behavior is identical across languages.


Getting started

Install the JS SDK at v4.12.0 or later:


Or the Python SDK:


Create a persona at lab.anam.ai, grab an API key, and start listening for events. The React component above is a working starting point you can drop into any project.

The conversation state machine pattern is what most of our customers end up building. Four states, four visual treatments, and the whole interaction feels like the avatar is actually paying attention.

Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content