Building a Real-Time AI Avatar Assistant with OpenAI Realtime + HeyGen

Building a Real-Time AI Avatar Assistant with OpenAI Realtime + HeyGen

Written by
Written by

Debakshi B.

Post Date
Post Date

Nov 27, 2025

shares

1_VTvT3HVUuH6Wfc7R3_PgIA

Real-time voice interfaces once required a patchwork of APIs, sockets, and custom audio pipelines. Today, we can build fully interactive, streaming multimodal experiences — complete with live transcription, dynamic responses, and photorealistic avatars — using only a few well-structured components.

In this post, we’ll walk through how to build an end-to-end conversational system that uses:

1. OpenAI’s Realtime API for two-way speech interaction

2. HeyGen’s Streaming Avatar for natural, synchronized video responses

3. A vector database (Pinecone) for retrieval-augmented answers

4. A small backend to broker authentication and embeddings

5. A browser client for audio capture, VAD, and the live avatar

What Can We Make With This?

Customer Support Digital Agent

Medical or Enterprise Training Simulators

AI Concierge Experiences

Hands-free Productivity Tools

Why Real-Time Matters

Traditional chat UIs wait for turn-based messages. Real-time agents behave differently: they listen continuously, transcribe speech as you talk, and respond in streaming fragments. It’s a completely different UX — and much more natural.

When we add a talking avatar, the goal shifts again: responses need to arrive fast, appear human, and stay adaptable to context. That requires a pipeline where:

Let’s break down how the pieces connect.

Key components:

Backend (Node + Express)

Browser Client

OpenAI Realtime Session

HeyGen Avatar

The glue between these pieces is a set of tight event loops that respond to speech start/end, text deltas, and avatar playback.

How the Realtime API Fits In

The Realtime API handles two critical tasks:

1. Live Transcription

As raw PCM audio streams in, events like:

conversation.item.input_audio_transcription.completed

fire in real time. These events carry the user’s spoken text, which we surface in the UI and store in the agent’s context.

2. Streaming Responses

Your code listens for both:

transport.on("response.output_text.delta", (event) => { ... }); transport.on("response.output_text.done", (event) => { ... });

This allows the UI (and avatar) to receive feedback immediately, even before a sentence finishes — an essential detail for natural UX.

How we configure the session

session = new RealtimeSession(agent, {   model: "gpt-realtime",   config: {     inputAudioFormat: "pcm16",     outputModalities: ["text"], //important     audio: { input: { format: "pcm16" } }   } });

We intentionally set text-only output because HeyGen handles speech synthesis on its own.

function setupRealtimeEventListeners( session: RealtimeSession<AgentContextData> ) {   const transport = session.transport;   if (!transport) {     console.error("No transport available on session");     return;   }   // Force text-only output mode on session creation   transport.on("session.created", () => {     session?.transport.sendEvent({       type: "session.update",       session: { type: "realtime", output_modalities: ["text"] },     });   });   // Streaming assistant text deltas   transport.on("response.output_text.delta", (event: any) => {     const delta = event.delta || "";     if (delta) {       window.dispatchEvent( new CustomEvent("realtimeTranscriptDelta", {         detail: { delta },       }) );     }   });   // Complete assistant response text   transport.on("response.output_text.done", (event: any) => {     const transcript = event.text || "";     if (transcript) {       console.log("Complete assistant transcript:", transcript);       window.dispatchEvent( new CustomEvent("realtimeTranscriptComplete", {         detail: { transcript },       }) );     }   });   // User speech transcription   transport.on("conversation.item.input_audio_transcription.completed", (event: any) => {     const userTranscript = event.transcript || "";     if (userTranscript) {       console.log("User said:", userTranscript);       window.dispatchEvent( new CustomEvent("userTranscript", {         detail: { transcript: userTranscript },       }) );     }   });   // Errors   transport.on("error", (event: any) => {     console.error("Realtime session error:", event);     window.dispatchEvent( new CustomEvent("realtimeError", {       detail: { error: event },     }) );   }); }

Fixing Output Modalities

One tricky part of the Realtime API is getting text-only mode to actually stick. The docs mention output modalities, but they don’t explain that the setting you pass when creating the session doesn’t last. The session quietly resets to audio+text, so the model keeps producing audio even if you disable every TTS feature on your side.

The fix is simple but easy to miss:

you must send a session.update event right after the session is created.

That update locks the modality to ["text"]. Without it, the API keeps sending audio packets and your text-only pipeline behaves strangely, especially when you plug it into something like a HeyGen avatar flow.

Once you add the update call, the Realtime assistant finally stops generating audio and the rest of the system works as expected.

Adding Domain Knowledge with Vector Search

Most real-world assistants need access to private knowledge:

documentation, transcripts, policies, FAQs, or project data.

Here we integrate a Pinecone index via a custom tool:

export const knowledgeBaseTool = tool({   name: "knowledge_base",   execute: async ({ query, topK }) => {     // 1. Generate embedding     const embeddingVector = await fetch("/api/generate-embedding", ...);     // 2. Query Pinecone     const searchResults = pineconeClient.query({       vector: embeddingVector,       topK,       includeMetadata: true     });     return { results: formattedResults };   }, });

The agent’s system prompt explicitly instructs:

When the user asks about information that may exist in the knowledge base, call the "knowledge_base" tool to search the vector database...

This means the model will automatically:

This is retrieval-augmented generation without extra orchestration code.

Connecting HeyGen to the Loop

Once the Realtime API finalizes a response, we pass the transcript to the HeyGen Streaming Avatar.

const speakResult = await avatar.speak({ text: transcript, taskType: TaskType.REPEAT, });

A few important details from the implementation:

1. Avatar session bootstrapping

Your backend issues a token:

app.get("/api/getToken", async (_req, res) => { const response = await fetch("https://api.heygen.com/v1/streaming.create_token", { ... }); res.json({ token: data.data.token }); });

Then the browser spins up the avatar:

avatar = new StreamingAvatar({ token });
sessionData = await avatar.createStartAvatar({
  quality: AvatarQuality.High,
  avatarName: AVATAR_CONFIG.avatarName,
  voice: { voiceId: AVATAR_CONFIG.voiceId },
});

2. Handling streaming video

HeyGen streams a MediaStream; your client assigns it directly:

videoElement.srcObject = event.detail;

The result is a live, speaking avatar that feels reactive and believable.

3. Syncing with the response pipeline

The client waits for response.output_text.done or realtimeTranscriptCompletebefore triggering speech so the avatar speaks complete thoughts.

function setupRealtimeListeners() {
  console.log("Setting up Realtime listeners for HeyGen integration");

  // User transcripts
  window.addEventListener(
    "userTranscript",
    ((event: CustomEvent) => {
      const { transcript } = event.detail;
      console.log("User transcript event received:", transcript);
      try {
        addToChat("user", transcript);
      } catch (e) {
        console.warn("Could not add user message to chat:", e);
      }
    }) as EventListener
  );

  // Assistant transcript deltas
  window.addEventListener(
    "realtimeTranscriptDelta",
    ((event: CustomEvent) => {
      const { delta } = event.detail;
      currentTranscriptBuffer += delta;
      console.log(
        "Transcript delta received, buffer now:",
        currentTranscriptBuffer.substring(0, 50) + "..."
      );
    }) as EventListener
  );

  // Complete assistant transcript
  window.addEventListener(
    "realtimeTranscriptComplete",
    ((event: CustomEvent) => {
      const { transcript } = event.detail;
      handleRealtimeTranscript(transcript);
    }) as EventListener
  );

  // Errors
  window.addEventListener(
    "realtimeError",
    ((event: CustomEvent) => {
      const { error } = event.detail;
      console.error("Realtime error received:", error);
      updateStatus("Error in voice session", "error");
    }) as EventListener
  );

  console.log("Realtime listeners setup complete");
}
  

Voice Activity Detection for a Natural UX

Instead of recording nonstop, the browser uses MicVAD:

vad = await MicVAD.new({ onSpeechStart: () => clearInactivityTimer(), onSpeechEnd: () => startInactivityTimer(), });

This lets us:

Users simply talk; the system handles turn-taking automatically.

Putting It All Together: Request Flow

1. User starts a session

Frontend requests /api/session → receives OpenAI client secret → Realtime session connects.

2. Use VAD to prevent false starts

Without VAD, background noise triggers streaming and wastes tokens.

3. Keep the agent prompt simple and declarative

The model follows tool instructions more reliably when the expectations are explicit and unambiguous.

4. Proxy all secrets through a backend

Your /api/session and /api/getToken endpoints ensure no API keys leak to browsers.

5. Pinecone queries should return metadata at minimum

Metadata allows the model to attribute information and avoid hallucination.

Final Thoughts

What makes this stack compelling is how seamlessly the pieces fit together. The Realtime API handles transcription and reasoning, Pinecone grounds responses in private knowledge, and HeyGen transforms text into a lifelike presence. The result feels less like a chatbot and more like a real interactive companion.

If you’re building the next generation of conversational interfaces — products where people speak naturally, get grounded answers, and interact through expressive avatars — this architecture is an excellent starting point.