D-ID Just Launched V4. Here's What It Means for Real-Time AI Avatars.

D-ID announced V4 Expressive Visual Agents on March 16, 2026. It's a significant release. New rendering architecture, lower latency, sentiment-aware expressions, and an aggressive price point. The avatar space just got more competitive.

We think that's a good thing. More competition means faster progress for everyone. But it also means buyers need to understand the differences. Here's our honest breakdown.

What D-ID V4 actually brings

The headline feature is a new diffusion-based rendering model, trained on performances captured from real actors. This is a meaningful architectural shift from D-ID's previous approach and it shows in the output quality. Avatars can now render at up to 4K resolution.

Sentiment-aware expressions are the other standout. The avatar dynamically aligns its facial expressions with the tone of the conversation. Empathy looks empathetic. Urgency feels urgent. There's also an optional camera layer that reads the user's nonverbal cues in real time, which is a genuinely interesting feature for use cases like coaching or training.

D-ID also introduced what they call "MCP Apps": the ability to surface interactive UI elements (images, charts, forms, quizzes) inline during a conversation. This moves the avatar from a talking head toward something closer to a full conversational interface.

The pricing starts at $5.90/month, which D-ID says is 70x cheaper than Google VEO 3 Fast. That's aggressive, and it signals that D-ID wants volume, not just enterprise contracts.

Credit where it's due: this is real progress. D-ID has built a large platform with 800,000+ visual agents and 300M+ non-interactive avatars created on their previous models. V4 is a serious step forward.

Where things get interesting: latency

D-ID claims sub-500ms conversational latency for V4. That's a big improvement over their previous generation, and it puts them in the ballpark for real-time interaction.

But ballpark and best-in-class are different things.

Anam's CARA III architecture delivers sub-200ms latency. That's a 2.5x difference. In raw numbers, 300 milliseconds doesn't sound like much. In conversation, it's the gap between a response that feels natural and one where you notice the delay. Human turn-taking in conversation happens in roughly 200ms on average. Every millisecond above that threshold chips away at the feeling that you're talking to someone rather than waiting for something.

This is why latency matters more than resolution, expression variety, or any other spec on the sheet. You can have the most photorealistic avatar in the world, but if there's a perceptible pause after every sentence, the illusion breaks.

The independent data

In January 2026, research firm Mabyduck published an independent evaluation of interactive avatar platforms. They tested Anam, D-ID, Tavus, and HeyGen with 178 participants across multiple metrics.

Anam ranked #1 across all measured dimensions, with statistical significance at p < 0.001. The full results are available at avatarbenchmark.com.

The most telling finding: responsiveness was the single strongest predictor of overall experience quality. Not visual fidelity. Not expression range. Responsiveness.

It's worth noting that during the October 2025 evaluation period, D-ID required external speech-to-text integration (Cartesia) because they didn't support native voice interactions at the time. V4 may change this, and we'd welcome an updated benchmark that tests the new architecture.

Different origins, different architectures

D-ID started as an image-to-talking-video company. Their Creative Reality Studio let you upload a photo and generate a video of it speaking. It was clever, and it found a massive audience.

Real-time conversational interaction is a different problem. When you're generating pre-recorded video, you can take seconds (or minutes) to render. When you're in a live conversation, every frame has to be generated and delivered in real time, with no perceptible buffering. The architecture you need for one doesn't naturally extend to the other.

D-ID's V4 is optimised to serve both use cases: pre-recorded video generation and real-time interaction. That's a valid strategic choice, especially for a platform with millions of existing users who want both.

Anam took a different path. We built CARA (Conversational Autonomous Realtime Avatar) from scratch for one purpose: real-time interactive conversation. No pre-recorded video mode. No batch rendering. Every architectural decision, from the rendering pipeline to the streaming protocol, is optimised for the lowest possible latency in a live interaction.

Neither approach is wrong. But the trade-offs are real. When you optimise for two things, you compromise on both. When you optimise for one thing, you can push it further.

What this means for the market

D-ID's V4 launch validates something we've been saying for a while: the market for real-time interactive avatars is real, it's growing, and enterprises are ready to deploy. When a company with 1,500 enterprise customers makes real-time conversation a headline feature, that's a signal.

It also raises the bar for what buyers should expect. Sentiment-aware expressions should be table stakes. Sub-second latency should be the minimum, not the target. And independent benchmarks should be part of every evaluation process.

We think the next 12 months will separate the platforms that treat real-time interaction as a feature from the ones that treat it as the product. Both can succeed. But they'll serve different customers with different priorities.

If you're evaluating avatar platforms

A few things worth considering:

Test latency yourself. Spec sheets are marketing. Run a conversation. Count the pauses. That's the number that matters.
Check the benchmark. Mabyduck's evaluation is the most rigorous independent comparison published to date. Use it as a starting point.
Understand the architecture. Ask whether the platform was built for real-time or adapted for it. The answer tells you where the engineering focus is.
Look at the developer experience. How fast can you go from API key to working prototype? Anam's documentation and Lab are designed to get you there in minutes, not weeks.

Where we go from here

D-ID has pushed the conversation forward with V4. We respect the engineering work behind it. The avatar market is better when multiple teams are solving hard problems.

Our focus hasn't changed. We're building the fastest, most responsive real-time avatar platform available. CARA III is the result of that focus, and we're not done.

If you want to see how Anam compares, the best way is to try it. Book a demo or explore the interactive playground. The numbers speak for themselves, but the experience speaks louder.

For a deeper technical comparison with D-ID's platform, check out our D-ID API review.

Frequently Asked Questions

How does Anam's latency compare to D-ID V4?

D-ID V4 claims sub-500ms latency. Anam's CARA III delivers sub-200ms — 2.5x faster. In conversation, that difference is noticeable. Independent benchmarks confirmed responsiveness as the strongest predictor of experience quality, ahead of visual fidelity.

What's new in D-ID V4?

A new diffusion-based rendering model, sentiment-aware expressions, and an optional camera layer that reads user nonverbal cues. Real progress. For a full breakdown, see our D-ID API review and direct comparison.

Is there independent data comparing the two platforms?

Yes. Mabyduck tested Anam, D-ID, Tavus, and HeyGen with 178 participants in January 2026. Anam ranked first across all metrics. Full results at avatarbenchmark.com. Note: the test predates V4. See also our 2026 avatar platform guide.

How should I evaluate avatar platforms?

Test latency yourself — run a conversation and count the pauses. Check independent benchmarks. Ask whether the platform was built for real-time or adapted from pre-recorded video. Our conversational AI API guide covers the key questions to ask, and you can compare pricing directly.

What does D-ID V4 mean for businesses evaluating avatars now?

It confirms the market is ready. It also raises the bar — sub-second latency and expressive avatars are now the minimum expectation. For sales, healthcare, recruitment, or L&D use cases, architecture differences have real impact. Book a demo and test it yourself.

Never miss a post

Get new blog entries delivered straight to your inbox.

Never miss a post

Get new blog entries delivered straight to your inbox.

In this article

Table of Content