Building Real-Time AI Voice Agents with Twilio and OpenAI

One of the core pieces of Loquent is the AI voice agent — a system that can answer phone calls, hold natural conversations, and take actions in real time. In this post, we’ll walk through the architecture behind it.

The Architecture at a Glance

At the highest level, the flow looks like this:

An inbound call hits Twilio, which opens a WebSocket media stream
Our server receives raw audio frames and pipes them to OpenAI’s Realtime API
The AI generates a spoken response, which we stream back through the same WebSocket
The caller hears the response with sub-second latency

The key insight: there’s no transcribe-then-generate-then-synthesize pipeline. The Realtime API handles speech-to-speech directly, which is what makes the experience feel natural rather than stilted.

Twilio Media Streams

When a call comes in, we respond with TwiML that opens a bidirectional media stream:

<Response>
  <Connect>
    <Stream url="wss://your-server.com/media-stream" />
  </Connect>
</Response>

Twilio sends us audio as base64-encoded mulaw frames at 8kHz. We need to decode, resample to 24kHz PCM16, and forward to OpenAI.

// Simplified audio pipeline
ws.on('message', (data) => {
  const msg = JSON.parse(data);
  if (msg.event === 'media') {
    const audio = Buffer.from(msg.media.payload, 'base64');
    const resampled = resample(audio, 8000, 24000);
    openai.send({ type: 'input_audio_buffer.append', audio: resampled });
  }
});

OpenAI Realtime API

The Realtime API is a WebSocket-based interface that accepts audio input and produces audio output — all in streaming fashion. We configure it with a system prompt that defines the agent’s personality and capabilities:

const session = {
  type: 'session.update',
  session: {
    modalities: ['text', 'audio'],
    instructions: 'You are a helpful receptionist for Acme Corp...',
    voice: 'alloy',
    input_audio_format: 'pcm16',
    output_audio_format: 'pcm16',
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      silence_duration_ms: 500,
    },
  },
};

The server_vad turn detection is crucial — it lets the API detect when the caller has stopped speaking, so the agent knows when to respond. Getting the threshold and silence_duration_ms right is key to natural conversation flow.

Handling Tool Calls

The real power comes from giving the AI tools — functions it can call mid-conversation. For example, booking an appointment:

const tools = [
  {
    type: 'function',
    name: 'book_appointment',
    description: 'Book an appointment for the caller',
    parameters: {
      type: 'object',
      properties: {
        date: { type: 'string', description: 'ISO 8601 date' },
        name: { type: 'string' },
        reason: { type: 'string' },
      },
      required: ['date', 'name'],
    },
  },
];

When the AI decides to call a tool, we get a response.function_call_arguments.done event, execute the function, and send the result back. The AI then incorporates the result into its next spoken response seamlessly.

Latency Optimization

For voice conversations, latency is everything. Here’s what we do to keep response times under 800ms:

Pre-warm connections — Keep the OpenAI WebSocket alive between calls
Stream-first — Start playing audio as soon as the first chunk arrives, don’t wait for the full response
Edge deployment — Run the media server close to Twilio’s infrastructure
Interrupt handling — If the caller starts speaking mid-response, cancel the current output immediately

What’s Next

We’re actively working on:

Multi-language support with automatic language detection
Call transfer — seamlessly hand off from AI to human agents
Custom voice cloning for brand-consistent agent voices
Conversation memory across multiple calls with the same contact

Voice AI is moving incredibly fast, and we’re building Loquent to stay on the cutting edge. If you’re interested in building on top of our platform, get in touch.