Using Gemini Vision with Anam in LiveKit

LiveKit agents can do more than just talk—they can see. By combining Gemini's vision capabilities with Anam avatars, you can build assistants that watch the user's screen and take action on what they see.

In this cookbook, we'll build an HR onboarding assistant. The user shares their screen showing an employee form, and the assistant guides them through filling it out. When the user provides information verbally, the assistant uses function tools to fill in the form fields automatically.

The complete code is at anam-org/anam-livekit-demo.

What you'll build

An onboarding assistant that:

Displays an Anam avatar as the visual interface
Uses Gemini Live for voice conversation and screen understanding
Watches the user's screen share to see form fields
Fills out forms automatically using function tools
Runs as a LiveKit agent that joins rooms on demand

How the pieces fit together

This demo combines three services:

Gemini Live handles the conversation by listening to the user's voice, processing their screen share, and deciding what to say or do
Anam generates the avatar video, synchronized to the agent's speech
LiveKit ties it all together, routing audio and video between the user, the agent, and the avatar

When the user speaks, their audio goes to Gemini. Gemini can also see frames from the user's screen share. Based on what it hears and sees, Gemini responds with text, which Anam turns into avatar video. Gemini can also call function tools to interact with the page.

Prerequisites

Python 3.9+
A LiveKit Cloud account
A Gemini API key
An Anam API key from lab.anam.ai

Project setup

Clone the demo repository:

git clone https://github.com/anam-org/anam-livekit-demo.git
cd anam-livekit-demo/agent

Create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

The key dependencies are:

livekit-agents - The LiveKit agent framework
livekit-plugins-google - Gemini Live integration
livekit-plugins-anam - Anam avatar integration

Create a .env file with your credentials:

# LiveKit Cloud credentials
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_api_key
LIVEKIT_API_SECRET=your_api_secret

# Anam
ANAM_API_KEY=your_anam_key
ANAM_AVATAR_ID=your_avatar_id

# Google Gemini
GEMINI_API_KEY=your_gemini_key

You can find avatar IDs at lab.anam.ai/avatars.

Understanding the agent code

Let's walk through agent.py. We'll start with the imports and setup:

import asyncio
import json
import logging
import os
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv

load_dotenv(Path(__file__).parent / ".env")

from livekit import rtc
from livekit.agents import (
    Agent,
    AgentSession,
    AutoSubscribe,
    JobContext,
    WorkerOptions,
    cli,
    function_tool,
)
from livekit.agents.voice import VoiceActivityVideoSampler, room_io
from livekit.plugins import anam, google

We import the LiveKit agent framework, the Anam and Google plugins, and the function_tool decorator for creating tools the agent can call.

Function tools for browser control

The agent needs a way to interact with the frontend. We use LiveKit's data channel to send commands:

_current_room: Optional[rtc.Room] = None

async def send_control_command(command: str, data: dict) -> None:
    """Send a control command to the frontend via data channel."""
    if _current_room is None:
        return

    message = json.dumps({"type": command, **data})
    await _current_room.local_participant.publish_data(
        message.encode("utf-8"),
        reliable=True,
        topic="browser-control",
    )

This sends JSON messages to the frontend, which listens on the browser-control topic and executes the commands.

Now we define the tools themselves. The @function_tool decorator exposes these to Gemini:

@function_tool
async def fill_form_field(field_identifier: str, value: str) -> str:
    """Fill in a form field on the current page.

    Args:
        field_identifier: The field to fill (e.g. "Full Name", "Email Address")
        value: The value to enter into the field

    Returns:
        A confirmation message
    """
    await send_control_command(
        "fill_field", {"field": field_identifier, "value": value}
    )
    return "ok"


@function_tool
async def click_element(element_description: str) -> str:
    """Click a button or link on the page.

    Args:
        element_description: Button/element text (e.g. "Submit", "Next")

    Returns:
        A confirmation message
    """
    await send_control_command("click", {"element": element_description})
    return "ok"

The docstrings are important. Gemini uses them to understand when and how to call each tool. When the user says "My name is John Smith", Gemini sees the form on screen, understands it needs to fill the name field, and calls fill_form_field("Full Name", "John Smith").

The agent entry point

The entrypoint function runs when the agent joins a room:

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)

    global _current_room
    _current_room = ctx.room

We connect to the room and subscribe to all tracks. The SUBSCRIBE_ALL option means we'll receive the user's screen share video, which is essential for vision.

Agent instructions

The instructions tell Gemini how to behave and what tools are available:

instructions = (
        "You are Maya, a friendly HR onboarding assistant. "
        "You can see the user's screen share.\n\n"
        "THE FORM HAS THESE 6 FIELDS (fill ALL before submitting):\n"
        "1. Full Name\n"
        "2. Email Address\n"
        "3. Phone Number\n"
        "4. Department\n"
        "5. Job Title\n"
        "6. Start Date\n\n"
        "Tools:\n"
        "- fill_form_field(field_name, value) - use EXACT field names above\n"
        "- click_element('Submit') - ONLY after ALL 6 fields are filled\n\n"
        "IMPORTANT: You MUST fill ALL 6 fields before clicking Submit."
    )

Being explicit about field names helps Gemini use the tools correctly. The instructions also prevent premature form submission.

Creating the models

Now we set up Gemini and Anam:

# Create Gemini Live realtime model
    llm = google.realtime.RealtimeModel(
        api_key=os.environ.get("GEMINI_API_KEY"),
        voice="Aoede",
        instructions=instructions,
    )

    # Create Anam Avatar session
    avatar = anam.AvatarSession(
        persona_config=anam.PersonaConfig(
            name="Maya",
            avatarId=os.environ.get("ANAM_AVATAR_ID")
        ),
        api_key=os.environ.get("ANAM_API_KEY"),
    )

Gemini handles the conversation logic and voice output. The voice parameter sets Gemini's TTS voice. Anam takes that audio and generates synchronized avatar video.

Video sampling for vision

For screen share analysis, we configure how often to send frames to Gemini:

session = AgentSession(
        llm=llm,
        video_sampler=VoiceActivityVideoSampler(
            speaking_fps=0.2,  # 1 frame every 5 seconds while speaking
            silent_fps=0.1,    # 1 frame every 10 seconds while silent
        ),
        tools=[fill_form_field, click_element],
    )

The VoiceActivityVideoSampler is efficient, it samples more frequently during active conversation and less during silence. This keeps Gemini aware of screen changes without overwhelming it with frames.

Starting everything

Finally, we start the avatar and agent:

await avatar.start(session, room=ctx.room)
    await session.start(
        agent=Agent(instructions=instructions),
        room=ctx.room,
        room_input_options=room_io.RoomInputOptions(video_enabled=True),
    )

The video_enabled=True option tells the agent to accept video input (the screen share). The avatar starts first so it's ready to display when the agent begins speaking.

Running the agent

The main block starts the agent worker:

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

Running the demo

Start the agent in development mode:

python agent.py dev

The agent connects to LiveKit Cloud and waits for rooms to be created.

For the frontend, go back to the repository root and start the Next.js app:

cd ..
pnpm install
pnpm dev

Open http://localhost:3000. You'll see a demo onboarding form. Click to connect, then share your screen. The avatar will greet you and guide you through filling out the form.

Try saying things like:

"My name is John Smith"
"My email is john@example.com"
"I'm starting in the Engineering department as a Senior Developer"

The assistant will fill in the fields as you provide information.

Adapting for your use case

The onboarding form is just one example. The same pattern works for:

Technical support - Watch the user's screen and guide them through troubleshooting
Education - See what the student is working on and provide contextual help
Data entry - Fill out complex forms based on verbal input
Accessibility - Help users who have difficulty using a keyboard

To adapt the demo:

Update the instructions to describe your use case and available fields
Modify the function tools to match your frontend's expectations
Update the frontend to handle the control commands appropriately

Deploying to production

For production, run without the dev flag:

python agent.py

The repository includes a Dockerfile for containerized deployments:

docker build -t onboarding-agent .
docker run --env-file .env onboarding-agent

See the LiveKit deployment docs for Kubernetes and cloud platform guides.