ElevenLabs + Twilio: Create an AI That Responds in Real-Time (Part 2)

In Part 1 we built a Node.js server that receives real-time audio from phone calls. But right now the AI just sits there silently. It can hear you, but it cannot respond.

Today we make it talk back. We connect ElevenLabs Conversational AI to our Twilio stream. By the end of this video, you will have an AI that can actually have a real phone conversation with you.

What You Get By the End

Call your Twilio number
Hear a natural-sounding voice greet you
Speak, get a real-time response from the AI
Have a back-and-forth conversation that feels close to talking to a person

What You Need

The Node.js + Twilio server from Part 1 (working outbound and inbound calls, WebSocket stream)
An ElevenLabs account (elevenlabs.io)
An ElevenLabs API key
An ElevenLabs Conversational AI agent (we will create one)

Step 1: Create an ElevenLabs Account

Sign up at elevenlabs.io. The free tier includes some credits to test with. Go to Profile > API Key and save your key in the .env file:

ELEVENLABS_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxx

Step 2: Create a Conversational AI Agent

In ElevenLabs go to Conversational AI > Agents. Click Create Agent. The agent has three important fields:

Voice: pick any voice from the library. I used a calm professional voice.
System prompt: describe how the agent should behave. Something like "You are a friendly customer support agent for an internet service provider. Help the customer troubleshoot connectivity issues. Keep responses short and conversational."
First message: what the agent says when the call starts. For example: "Hi, this is customer support. How can I help you today?"

Save the agent. Copy the Agent ID from the agent page and add it to your .env:

ELEVENLABS_AGENT_ID=agent_xxxxxxxxxxxxx

Step 3: Wire ElevenLabs Into the WebSocket Server

Open server.js from Part 1. We are going to modify the WebSocket handler to bridge Twilio's audio stream with ElevenLabs.

const WebSocket = require("ws");

wss.on("connection", async (twilioWs) => {
  console.log("Twilio stream connected");

  // 1. Get a signed URL for the ElevenLabs websocket
  const signedUrl = await fetch(
    `https://api.elevenlabs.io/v1/convai/conversation/get_signed_url?agent_id=${process.env.ELEVENLABS_AGENT_ID}`,
    { headers: { "xi-api-key": process.env.ELEVENLABS_API_KEY } }
  ).then(r => r.json()).then(d => d.signed_url);

  // 2. Open a WebSocket to ElevenLabs
  const elevenWs = new WebSocket(signedUrl);

  // 3. Twilio -> ElevenLabs (forward incoming audio)
  twilioWs.on("message", (msg) => {
    const data = JSON.parse(msg);
    if (data.event === "media" && elevenWs.readyState === WebSocket.OPEN) {
      elevenWs.send(JSON.stringify({
        user_audio_chunk: data.media.payload
      }));
    }
  });

  // 4. ElevenLabs -> Twilio (forward AI audio back to the call)
  elevenWs.on("message", (msg) => {
    const data = JSON.parse(msg);
    if (data.type === "audio") {
      twilioWs.send(JSON.stringify({
        event: "media",
        media: { payload: data.audio_event.audio_base_64 }
      }));
    }
  });

  // 5. Clean up on disconnect
  twilioWs.on("close", () => elevenWs.close());
  elevenWs.on("close", () => twilioWs.close());
});

That is the whole bridge. Two WebSocket connections, audio flowing both ways.

Step 4: Restart the Server and Make the Call

node server.js

From another terminal, trigger an outbound call to your verified number (same curl command as Part 1):

curl -X POST http://localhost:3000/call \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "to=+1xxxxxxxxxx"

Answer the phone. After a half second of latency, you will hear your agent's first message ("Hi, this is customer support..."). Speak. The agent responds. You can hold a real conversation.

What This Actually Sounds Like

The ElevenLabs voice is good. Better than most pre-2024 TTS. The latency is the bigger story: end-to-end (you speak, the AI hears, decides, generates a response, sends it back through Twilio to your phone) is around 500ms to 1.5 seconds depending on network conditions. Fast enough that a conversation feels natural with minor pauses.

When I tested the customer support agent, the back-and-forth went something like:

Me: Hello, I am calling about my internet service. It hasn't been working since Tuesday.
AI: I understand. I can help you troubleshoot that. Let's start by checking the modem lights. Can you tell me what color the power light is showing?

Convincing enough that on a quick call you might not immediately realize you are talking to an AI.

Real Use Cases (Honest)

This stack is great for:

Appointment reminders ("Hi, this is a reminder about your appointment tomorrow at 3pm")
Lead qualification ("Hi, I'm calling to confirm your interest in our service")
Customer support level 1 (FAQ-type queries that do not need a human)
Survey collection

Not great for:

Anything requiring nuanced empathy or emotional intelligence
Complicated multi-system workflows
Calls that need a human to authorize or commit to anything significant
Regulatory contexts where disclosure of AI is mandatory (always check local law)

Cost Notes

Twilio: per-minute call rate (varies by country, typically 1 to 5 cents/min for US outbound)
ElevenLabs: Conversational AI is metered in credits. Several minutes of conversation per dollar on the standard plan.
Combined: a 5-minute conversation runs around 10 to 30 cents. Cheap for the value, but it adds up if you are doing volume.

Part 1 Recap

If you have not seen Part 1, build the foundation first: Twilio Phone Calls with Node.js: AI Voice Agent Part 1

Subscribe to AyyazTech for more AI voice agent tutorials. Part 3 will cover inbound calls and saving conversation history to a database.