Building a Voice AI Agent That Actually Sounds Human
The technical architecture behind a conversational voice bot that handles inbound scheduling, outbound follow-ups, and dynamic objection handling, at scale.
Introduction
Voice AI has a perception problem. Most people's mental model is still the IVR hell of 'press 1 for billing', rigid, robotic, and infuriating. Modern voice AI, built correctly with agentic LLM reasoning and low-latency TTS, is a genuinely different experience.
Our Conversational Voice Automation Agent handles three use cases for clients: inbound appointment scheduling, outbound reminder calls, and post-service follow-up surveys. This post covers the architecture decisions that make it feel natural rather than mechanical.
The Tech Stack
The core pipeline is: telephony (Twilio) → ASR (Deepgram Nova-2) → LLM reasoning (GPT-4o with function calling) → TTS (ElevenLabs with a cloned client voice) → telephony. The entire round-trip, from the user finishing speaking to the bot beginning to respond, targets under 800ms.
We use Deepgram over Whisper for ASR because of its streaming API: we start processing partial transcriptions while the user is still speaking, which allows us to begin LLM inference earlier and shave 200–400ms off perceived latency. ElevenLabs with voice cloning lets clients use a custom voice that matches their brand, which dramatically improves caller trust metrics.
Solving for Latency
Latency is the enemy of natural conversation. Humans expect a response within 300–500ms of finishing a sentence. Our baseline pipeline without optimization sits around 1.2s, noticeably awkward. We get it below 800ms through three techniques.
First, we use streaming for both LLM output and TTS synthesis. We begin synthesizing audio as soon as the first sentence token is complete, we don't wait for the full LLM response. Second, we use a 'filler phrase' injection: if LLM inference will take more than 600ms, we synthesize a short filler phrase ('Let me check on that for you...') to fill the silence while the real response generates. Third, we pre-cache the 20 most common response openings so the TTS synthesis for those phrases is instant.
💡Measure Turn-Taking Latency Separately
Don't just measure end-to-end response time. Measure 'turn-taking latency', the gap between end of user speech and start of bot speech. This is what callers actually perceive. Target under 700ms for comfortable conversation.
Conversation Design
The LLM layer is given a system prompt that establishes persona, goals, and a set of tools: check_availability, book_appointment, update_crm, escalate_to_human. The agent uses function calling to take actions during the conversation, it doesn't just generate text, it actually does things.
Handling objections and unexpected inputs is where most voice bots fail. We use a 'graceful recovery' design: the system prompt includes 10 example recovery phrases for common derailment scenarios (caller goes off-topic, expresses frustration, asks an out-of-scope question). The LLM is instructed to always acknowledge before redirecting, never to argue, and to offer human escalation after two failed recovery attempts.
CRM Integration
Every completed call writes a structured summary back to the client's CRM (Salesforce or HubSpot, in our deployments). The summary includes: call outcome, any slots booked, detected sentiment, key points raised by the caller, and a full transcript.
This CRM integration turns the voice agent into a data asset, not just a cost-reduction tool. Managers can review call transcripts to spot training gaps, identify common objections, and track conversion rates by call script variation. For one healthcare client, this data loop helped them identify a scheduling friction point that reduced no-show rates by 22% once addressed.
Conclusion
A voice AI agent built with modern ASR, LLM reasoning, and low-latency TTS can handle the majority of routine call types with caller satisfaction comparable to human agents, at a fraction of the cost. The key investments are in latency optimization, conversation design, and the CRM data loop that makes the system improve over time.
Related Projects

Agentic Knowledge Assistant
An LLM-powered, multi-channel assistant that uses Retrieval-Augmented Generation (RAG) to autonomously answer employee o...

Autonomous Content-to-Learning Engine
An AI system that ingests PDFs, videos, or documents and autonomously creates assessments, flashcards, and learning summ...

Embeddable Role-Aware Chat Widget
A lightweight AI widget that plugs into any platform and adapts answers dynamically based on user role and platform cont...