Back to Knowledge Base
WhitepaperVoice AIPhone AISTTTTSLLM streamingContact centerContainment rateWarm handoffVapiRetellLiveKitElevenLabsDeepgramEU AI ActGDPRIVR replacementOutbound callingLead qualification

Voice AI agents — phone assistants in practice

ÁZ&A
Ádám Zsolt & Airon
||12 min read

Talking on the phone to a machine voice used to be frustrating. By 2025 it became something where half of callers don't even realise they weren't speaking with a human.


Executive summary

Phone-based customer service has been the same for 40 years: IVR menus, hold queues, "press 1". Until the early 2020s "AI voice bot" was a synonym for "bad experience".

Between 2024 and 2025 that changed. Three technological leaps converged:

  • STT (speech-to-text) ~95–98% accuracy in real time (Deepgram Nova-2, OpenAI Whisper-v3)
  • LLMs in streaming mode with 300–600ms first-token latency (Groq, fast inference)
  • TTS (text-to-speech) with natural voices (ElevenLabs, Azure Neural, Cartesia)

Together they deliver sub-1-second end-to-end latency — meaning natural conversation. And with that, voice AI has become a strategic tool, not just a technological curiosity.

This whitepaper covers when it's worth investing in voice AI, what you gain, what you lose, and when not to do it.


1. What is a voice AI agent?

A voice AI agent talks to the user on the phone or through a voice interface, in natural language, in real time. It's not a bot reading pre-written scripts — it's an interactive, context-aware system.

The 4 core components

Caller speaks
   ↓
[STT] Speech-to-text — continuous transcript
   ↓
[LLM] Language model — understands, responds
   ↓
[TTS] Text-to-speech — natural voice
   ↓
Caller hears

Plus supporting layers:

  • Phone integration: Twilio, Vonage, Plivo
  • VAD (Voice Activity Detection): when did the caller finish speaking?
  • Interruption handling: what if they cut in?
  • Tool calling: CRM lookup, booking, database
  • Sentiment detection: is the caller getting frustrated? Time to escalate?

Two main paradigms

ParadigmExampleWhen?
Cascade (STT→LLM→TTS)LiveKit, Vapi, RetellGeneral-purpose, flexible, scales well
End-to-end voice modelGPT-4o realtime, Gemini LiveMore natural prosody, more expensive

Cascade is the default today: cheaper, more controllable, more debuggable. End-to-end is emotionally richer but expensive and less predictable.


2. Where does it work well — and where doesn't it?

Sweet-spot use cases

1. Inbound — frequent, repetitive questions

  • Opening hours, address, general info
  • Booking status, account balance
  • "Where's my package?"
  • Healthcare appointment booking
  • Restaurant reservations

The common thread: 80% of callers ask the same thing. Human agents are wasted on it.

2. Outbound — informational calls

  • Appointment reminders
  • Payment reminders
  • Callback scheduling
  • Satisfaction surveys
  • Product update notifications

The common thread: not a complex request, just consistent communication at scale.

3. Lead qualification — sales support

  • First triage of inbound leads
  • Needs assessment (BANT: Budget, Authority, Need, Timeline)
  • Booking time with a sales rep

The common thread: human sales reps are expensive and 30% of first calls are irrelevant leads. AI filters.

4. Bilingual customer service

  • Multiple languages at once, automatically
  • Language detection mid-call
  • Smaller cities / markets without native-speaker agents

Harder use cases

Use caseWhy is it hard?
Crisis lines, mental healthEthical risk: AI can't reliably recognise a crisis
Complex legal/medical adviceLiability issues, accuracy critical
Elderly or cognitively impaired callersHard when they interrupt, repeat, or don't understand the AI
Strongly emotional complaint callsThe AI feels "empty" and provokes escalation
Complex product configurationLots of if-else, poorly navigated by voice

The rule: if the human agent's role is emotional or trust-based — don't automate fully. If the role is informational or procedural — automate.


3. The 5 strategic decisions before launch

Decision 1: Build, buy, or hybrid?

ApproachProConFor whom?
Buy (Vapi, Retell, Bland)Quick launch (1–2 weeks)Less customisable, vendor lock-inStandard use cases, early validation
Build on platform (LiveKit, Twilio + own logic)Flexible, brand-consistent2–4 months of devCustom workflows, already have an AI team
Build from scratchFull control6–12 months, expensiveOnly if it's core business (e.g. you're a contact-center vendor)

90% rule: if you're not an AI-platform company, buy or build-on-platform. Avoid scratch building.

Decision 2: What do you automate, what do you leave to humans?

The "containment rate" = how often the AI resolves the call without a human. Industry averages:

  • Good voice AI: 60–75% containment rate
  • Excellent: 80–85%
  • Upper ceiling: ~90% (the rest really does need a human)

Strategic call: DON'T aim for 100%. The best voice AI system recognises when it can't help and gracefully hands off to a human ("warm handoff" with full conversation context).

Decision 3: Voice and persona

This is not a trivial choice. Your voice becomes a brand asset.

  • Female vs. male voice: industry- and culture-dependent. Banking: often male (authority). Healthcare: often female (empathy). Worth A/B testing locally.
  • Age: a 25–35-year-old voice is generally "universally acceptable".
  • Accent: a local-market system needs native-language TTS. A foreign-accented voice is catastrophic.
  • Tempo: slower speech is better for older callers, faster for sales.
  • Disclosure: does the caller know they're talking to AI? The EU AI Act makes disclosure mandatory from 2025.

Decision 4: Latency vs. quality trade-off

ArchitectureLatency (P50)QualityCost
Cheap STT + GPT-4o-mini + standard TTS1.5–2.5sMedium$0.05–0.10/min
Premium STT (Deepgram) + GPT-4o + ElevenLabs700ms–1.2sHigh$0.15–0.30/min
End-to-end (GPT-4o realtime)400–800msVery high$0.30–0.60/min

Empirical rule: above 1.5 seconds callers start to feel the experience is "robotic". A premium stack reliably stays below.

Decision 5: Compliance and liability

  • GDPR: a voice recording is personal data. Retention period and consent are required.
  • Recording disclosure: "This call may be recorded for quality assurance."
  • AI disclosure (EU AI Act): "You are speaking with an artificial intelligence assistant."
  • PII redaction: card numbers, social security IDs etc. must be automatically redacted from recordings/transcripts.
  • Liability: who's responsible if the AI gives wrong info? The vendor contract usually limits theirs — read it.

4. Business ROI — what is it actually worth?

Example: mid-sized customer-service operation

Baseline:

  • 500 inbound calls per day
  • Average call length: 4 minutes
  • 5 agents (monthly fully loaded cost ~€1,500/agent = €7,500/month)
  • Cost per call: ~€1.50

Voice AI rollout (75% containment):

  • 375 calls handled by AI: 375 × 4 min × $0.15 = $225/day ≈ €200/day ≈ €6,000/month
  • 125 calls handled by humans: 2 agents sufficient → €3,000/month
  • Total: €9,000/month

Hmm — more expensive? Not necessarily. Look at the full picture:

Line itemHumans onlyVoice AI + humans
Direct cost€7,500 / mo€9,000 / mo
Availability8h / day24/7
Wait time3–8 min<5 seconds
ConsistencyVariable100%
Peak-time scalingFailsTrivial
Agent churn30–50% / yearn/a
Languages1–230+

"More expensive" only on direct cost. Add:

  • +15–25% lead conversion from 24/7 availability (for sales)
  • +10 NPS points from shorter wait times
  • HR savings from lower churn
  • 30–50% fewer abandoned calls thanks to peak handling

→ ROI positive within 3–6 months for most use cases.


5. Industry benchmarks (2025–26)

IndustryTypical containmentTypical latencyDominant use case
Restaurant80–90%1.0–1.5sReservations, menu
Healthcare60–70%1.2–1.8sAppointments, prescriptions
E-commerce70–80%1.0–1.5sPackage status, returns
Banking50–65%1.5–2.0sBalance, card block
Insurance55–70%1.5–2.0sClaims, policy info
Real estate75–85%1.0–1.5sLead qualification, viewings

Banking/insurance have lower containment because of compliance and complex products — they intentionally escalate to humans more often.


6. The 7 most common rollout mistakes

Mistake 1: Automate everything

The "100% AI" dream project always fails. The 90% sweet spot exists, but 100% doesn't. Designing the handoff moment is mandatory.

Mistake 2: Wrong voice choice

A 65-year-old banking customer is not the same as a 25-year-old tech-startup buyer. Match the voice to the audience.

Mistake 3: No warm handoff

The AI hands the call to a human — but doesn't hand over the context. The human agent "starts over" → catastrophic UX. Always pass the context (transcript, summary, data already collected).

Mistake 4: Only inbound, never outbound (or vice versa)

Voice AI works best in combination: when the customer calls in, or when we call them. Plan the two directions together.

Mistake 5: Too rigid or too flexible flow

  • Too flexible: the caller gets lost, the conversation becomes endless
  • Too rigid: the caller is frustrated because the AI doesn't grasp variation

The fix: main flow + escape hatches. The caller can say at any time "I want to talk to someone" → handoff.

Mistake 6: No monitoring or evaluation

AI conversations must be logged, tagged, reviewed. Sample them weekly. If you don't, you have no idea what the experience is like.

Mistake 7: The "fake human" trap

The AI denies being AI. The caller notices. Trust collapse.

Or the AI sounds so human that the caller agrees to a purchase without the AI/human status being properly communicated.

The fix: explicit AI disclosure at the start of the call. Research shows this does not reduce containment rate, and it will be mandatory in the EU.


7. The roadmap — how do you start?

Month 1: discovery

  • Call analysis: top-10 call types, share suitable for AI
  • Containment target (realistic: 50–70% in the first round)
  • Vendor evaluation (3–5 platforms tested)

Month 2: pilot

  • 1 use case, 1 language, 1 mini-flow
  • Internal testing (own team)
  • External pilot (10–50 real calls, opt-in)

Month 3: tuning + scale

  • Flow refinement based on results
  • Measure containment rate, NPS, ASA (average speed of answer)
  • Scale: more call types, more languages, more time-of-day coverage

Month 4–6: optimisation

  • A/B testing: voices, prompts, flows
  • Outbound integration
  • Deeper CRM integration
  • Continuous improvement: weekly review of the top-10 "failed" conversation categories

8. The future: voice AI 2027–2030

What can we expect over the next 3–5 years?

  • Real-time multilingual switching — language switching within a single conversation
  • Emotional modelling — TTS with audibly emotional tone (joy, regret, excitement)
  • Multimodal — SMS / email / app push synchronised during the call
  • Personalised voice — each customer gets the voice they personally find "pleasant"
  • Real-time empathy detection — the system notices frustration and escalates automatically
  • Voice clone disclosure — regulation around when known voices may be cloned

The big question is not technological: how much do we accept from a machine voice? It will be generational. Gen Z already prefers AI for several use cases (faster, no need to "wait around"). Older generations are sceptical — here you need to prove the value.


Summary: 7 takeaways

  1. Voice AI reached natural-conversation quality in 2024–25 — STT + streaming LLM + TTS together give sub-1s latency.

  2. Sweet spot: repetitive inbound, informational outbound, lead qualification, 24/7 availability. NOT sweet spot: emotional or trust-based calls.

  3. Buy or build-on-platform, don't build from scratch — unless AI platforms are your core business.

  4. Don't aim for 100% automation — 70–85% containment is the realistic target; everything else is a warm handoff.

  5. ROI positive within 3–6 months if you look at the full picture (24/7, NPS, peak handling, churn).

  6. Avoid the 7 mistakes: 100% automation, wrong voice, no warm handoff, no outbound, bad flow rigidity, no monitoring, fake-human trap.

  7. A 4–6 month roadmap (discovery → pilot → tuning → optimisation) gets there with less risk than a "big-bang" rollout.

Voice AI doesn't replace human customer service — it lifts it. Human agents get time for the complex, emotional, trust-based calls while AI takes the repetitive 70% off their plate. A modern, efficient contact centre in 2026 cannot operate without voice AI — just as it couldn't operate without email as a channel in 2010.

The question isn't whether you deploy it. It's when, and how well.


Planning a voice AI rollout — in a contact centre, customer service, or outbound campaign?

The Atlosz team takes you through the full journey: discovery (which call types are ready), vendor selection (Vapi / Retell / LiveKit / custom stack), local-language voice and persona design, CRM and telephony integration (Twilio, Vonage), warm-handoff design, GDPR + EU AI Act compliance, and continuous containment-rate optimisation.

Let's talk about your voice AI project →