Voxtral Voices from Mistral with an Anam Face

Harry Smaje

Apr 2, 2026

Mistral just released Voxtral TTS. It's a 4-billion-parameter text-to-speech model that does zero-shot voice cloning from as little as three seconds of audio. Nine languages, $0.016 per thousand characters, and benchmarks that put it alongside ElevenLabs v3 on naturalness while running faster than Flash v2.5.

The voice cloning is the standout. Mistral calls it "voice-as-instruction." You give it a short reference clip and the model picks up your intonation, rhythm, accent, and emotional tone. No prosody tags, no SSML. I tested it with several different accents and speaking styles, and the output was hard to distinguish from the reference in most cases.

Between Voxtral TTS and Mistral's LLM lineup, you can now build a voice agent that's almost entirely Mistral-powered. The brain reasons, Voxtral speaks. What you don't get is a face.

I built a demo that adds one. Try it right here:

Demo

Mistral handles the reasoning and voice synthesis. Anam's Cara 3 model takes the Voxtral audio output and generates an interactive avatar in real time, with lip sync, head movement, and expressions all derived from the audio signal.

How the pipeline works

Four models run in sequence, one per conversation stage.

Deepgram Nova 3 transcribes the user's speech into text. Mistral doesn't offer a dedicated STT model yet, so Deepgram fills that role. Mistral Small takes the transcript and generates a response. Voxtral Mini TTS converts that response into speech, either using a preset voice or a cloned one. Then Anam Cara 3 receives the audio stream and renders a face that matches it, which gets published back to the user as a live video feed.

LiveKit connects all of this. It handles the WebRTC transport, moving microphone audio from the user to the agent and sending avatar video back. The Anam LiveKit plugin sits between the TTS output and the video track, routing audio through Anam's face generation pipeline. If you've read the LiveKit avatar agent guide on our blog, the pattern is the same.

The entire agent backend is a single Python file. Rather than paste code here, the full source is on GitHub and the Anam docs cover the integration patterns.

Voice cloning in the demo

In the demo you switch to a "Clone My Voice" tab, record a short clip of yourself talking, and then the avatar speaks back in your voice. The whole flow takes about ten seconds.

Voxtral doesn't need fine-tuning or pre-processing. It takes your reference audio and captures your speaking style immediately. I tried it with a few different team members and it handled the variation in accent and pace each time. Combine that with Anam's avatar rendering and you get an agent that both looks and sounds like a specific person. We've already been thinking about how this applies to sales coaching and training simulations where you want a consistent persona.

For the more technical details on Voxtral, including language support, streaming latency, and the API reference, Mistral's TTS documentation covers it.

What the demo includes

The live demo lets you pick from four Anam avatars, choose a Voxtral voice preset or clone your own, and have a real-time conversation with a lip-synced avatar. The UI shows pipeline badges that light up as each model activates: Deepgram when listening, Mistral when thinking, Voxtral and Anam when speaking. I added that because it makes the data flow visible, which helps people understand what's actually happening at each stage.

The full source is open. The agent is one Python file, the frontend is Next.js on Vercel, and the README covers setup.

Building your own

You need API keys from Mistral for the LLM and TTS, Deepgram for STT, and an Anam account for the avatar. LiveKit Cloud has a free tier for development. Clone the repo, add your keys, and it runs.

If you already have a voice agent and want to add a face without changing your existing setup, Anam's API works with any audio source. Same approach as the ElevenLabs integration. The face layer is independent of the voice and brain behind it. Check our pricing for the free tier limits, or book a call if you're building something production-grade.

What would you build with this stack?

Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content