Speech detection events for better conversational AI design
There's a gap in most conversational AI design that nobody talks about. It's the dead air between when a user starts speaking and when the system acknowledges it.
You say something. The avatar stares at you. A second passes, maybe two. Then suddenly a transcript appears and the system snaps to attention. That delay isn't a processing problem. The system heard you immediately. It just didn't tell you.
The Anam JS SDK (v4.12.0+) and Python SDK now emit userSpeechStarted and userSpeechEnded events the instant server-side voice activity detection fires, before any transcription happens. Your UI can react the moment someone opens their mouth, not when the transcript arrives.
Why this gap breaks conversational AI design
Human conversation runs on real-time feedback loops. When you talk to someone, they nod, shift their gaze, lean in. These micro-signals happen within 200ms of hearing speech. Remove them and the conversation feels off, like talking to someone wearing headphones.
Most avatar SDKs only expose transcript events. The transcript arrives after the speech-to-text model finishes processing, which takes anywhere from 500ms to 2 seconds depending on the utterance length. During that entire window, your UI does nothing. The user has no idea if the system heard them, if their mic is working, or if they should speak louder.
Speech detection events close that gap. They fire the moment Anam's server-side Deepgram VAD detects voice energy, typically within 100-200ms of the user starting to speak. Use them to show a "listening" indicator, animate the avatar's attention, suppress other UI elements, or do anything else that signals acknowledgment.
The events
Both events pass a correlationId string that links a speech segment to its eventual transcript.
JavaScript
Python
The correlationId connects the speech detection to the transcript that arrives later via MESSAGE_STREAM_EVENT_RECEIVED. Match the IDs and you can track the full lifecycle: user starts speaking, user stops, transcript arrives, avatar responds. This same Python event system works if you're building with Pipecat or other agent frameworks.
Building a listening indicator
The most immediate use is a visual cue that appears the instant the user speaks. Here's a React implementation:
The indicator appears the moment the user starts talking and disappears when they stop. The cleanup in the useEffect return prevents memory leaks if the component unmounts mid-conversation.
Three patterns worth building
A boolean "listening or not" state is a start. Here are three patterns that make the interaction feel more considered.
Conversation state machine
Track where you are in the full conversation cycle:
Each state drives different UI. "Listening" shows an active mic indicator. "Processing" shows a thinking animation. "Responding" highlights the interactive avatar. "Idle" returns to the default state. Four states, four distinct visual treatments, and the transitions happen at the right moment instead of all being lumped together when the transcript arrives.
Interruption feedback
When the user speaks while the avatar is talking, that's an interruption. The speech detection events let you catch this before the system processes the interrupt:
Anam handles the actual interruption on the backend. The avatar stops, the pipeline resets, and the new user input gets processed. But without client-side events, the user sees nothing during the 200-400ms it takes for the interrupt to propagate. A quick visual flash ("switching to you") makes that transition feel intentional rather than glitchy.
Timeout safety net
If USER_SPEECH_ENDED never arrives because the connection dropped mid-speech, you don't want the UI stuck in a "listening" state:
The 10-second timeout is based on a practical edge case noted in the docs: if the WebRTC connection drops mid-speech, the server-side VAD stops firing but the client never receives the end event. Reset the UI and optionally warn the user about their mic.
How detection works under the hood
These events don't use the browser's Web Audio API or a local VAD model. The detection runs on Anam's server using Deepgram's voice activity detection. Your microphone audio travels over the WebRTC connection to Anam's backend, the VAD fires there, and the event comes back over the signaling channel.
This means the events reflect what the server actually received, not what the browser thinks it heard. If the network drops, the events stop because the server genuinely isn't getting audio. A local VAD would keep firing on ambient noise with no way to know the server was disconnected.
The tradeoff is latency. A local VAD fires in under 50ms. The server-side round-trip puts you at 100-200ms depending on connection quality. For most interactive avatar applications, that's fast enough. The difference between 50ms and 150ms isn't perceptible to users. What is perceptible is the 1-2 second wait for a transcript that you'd have without these events.
What not to do
Don't use these events to gate user input or suppress the microphone. They're signals for UI feedback, not audio controls. The server is always receiving audio regardless of your indicator state.
Don't build your own endpointing logic on top of them. The VAD fires on voice energy, not on semantic boundaries. A user pausing for breath triggers USER_SPEECH_ENDED followed by another USER_SPEECH_STARTED a moment later. Deciding when the user is actually done talking is a separate problem that Anam handles server-side with its endpointing model.
Don't assume a 1:1 mapping between speech segments and transcripts. A user might speak, pause, and resume before the STT model returns a single merged transcript. The correlationId helps you trace which segment produced which transcript, but the relationship isn't always one start-end pair per transcript.
FAQ
What's the difference between speech detection events and transcript events?
Speech detection events (USER_SPEECH_STARTED, USER_SPEECH_ENDED) fire the moment voice activity is detected, typically within 100-200ms. Transcript events arrive after the speech-to-text model finishes processing, which can take 500ms to 2 seconds. Use speech detection for instant UI feedback. Use transcripts when you need the actual words.
Do these events work with Custom LLM and Custom STT setups?
Yes. The speech detection runs on Anam's server-side VAD regardless of your pipeline configuration. Whether you're using Anam's full turnkey pipeline, bringing your own LLM, or bringing your own STT, the USER_SPEECH_STARTED and USER_SPEECH_ENDED events fire the same way. The VAD operates on the raw audio before it reaches any STT model.
Can I use these events without an interactive avatar?
The events are part of Anam's session infrastructure, so they require an active avatar session. They're designed for building responsive UI around real-time avatar conversations. If you're building a voice-only agent without a visual component, a client-side VAD library would be a better fit.
How do I handle rapid speech starts and stops?
The VAD fires on voice energy, so brief pauses mid-sentence can trigger a quick USER_SPEECH_ENDED followed by USER_SPEECH_STARTED. For UI indicators, consider adding a small debounce (200-300ms) before hiding the listening state. This prevents the indicator from flickering during natural speech pauses without noticeably delaying the initial appearance.
Which SDK versions support these events?
The JavaScript SDK supports them from v4.12.0 onward. The Python SDK supports them in the current release. Both use the same underlying server-side VAD, so the behavior is identical across languages.
Getting started
Install the JS SDK at v4.12.0 or later:
Or the Python SDK:
Create a persona at lab.anam.ai, grab an API key, and start listening for events. The React component above is a working starting point you can drop into any project.
The conversation state machine pattern is what most of our customers end up building. Four states, four visual treatments, and the whole interaction feels like the avatar is actually paying attention.
Explore more articles
© 2026 Anam Labs
HIPAA & SOC-II Certified





