Voice Channel — Overview

See a real call end-to-end

A Voice Call, End-to-End traces one real four-turn cardiology appointment call through every stage of this channel — STT, language-lock, the agentic orchestrator, follow-up resolution, the deterministic capability short-circuit, and TTS — with backend cognition captured live from the pilot.

Mission

The voice channel turns every inbound telephone call to ZOL into an intelligent, knowledge-base-grounded conversation — no DTMF menus, no call-center queue for navigation questions. A caller who asks "Waar is de cardiologie?" gets an immediate spoken answer drawn from the same 10 430-chunk corpus that powers the web chatbot, with the same zero-medical-advice safety envelope. The channel is in production on the local stack with Twilio Belgian PSTN number <PILOT_PSTN_NUMBER> and awaiting pilot DNS + TLS for full external rollout (ADR-0050 Phase B).

The voice surface is the highest-leverage hospital-search differentiator: voice is where the existing call-centre load lives, where elderly callers (the majority of hospital-helpdesk traffic) are most comfortable, and where the regulatory bar — AI Act Article 50(2) per-turn transparency — is hardest to discharge. Discharging it requires the spoken disclaimer prepender (Voice Safety Architecture, Stage 4) and the caller-auditable telemetry stack (Value Framework telemetry, category_mismatch_telemetry).

Five-layer architecture

Layer	Component	Role
Telephony	Twilio Elastic SIP Trunk	Owns the Belgian PSTN number; forwards SIP INVITE to LiveKit. (RFC 3261)
SIP gateway	`livekit-sip` container	Translates PSTN μ-law audio to Opus/WebRTC participant
Room	LiveKit server (@livekit_agents_docs)	Media relay between SIP participant and agent worker
Agent	`voice_agent` worker (livekit-agents 1.5.6)	Deepgram Nova-3 STT (@deepgram_nova3), per-turn WS call to backend, ElevenLabs Multilingual v2 TTS (@elevenlabs_multilingual_v2)
Backend	`VoiceLLMOrchestrator` (GPT-4.1 with tool-use)	regex pre-filter → LLM tool-call loop → safety post-filter → answer-shaper → disclaimer-prepender

The cognition layer

The backend VoiceLLMOrchestrator is agentic: a single GPT-4.1 agent with three tools — search_hospital_kb, transfer_to_helpdesk, end_call — decides, per turn, whether to answer directly, search the knowledge base, transfer the caller, or close the call. It calls the same RAGService.query_stream as the chat channel with channel="voice", which activates voice-specific behaviour:

Value Framework affinity rerank (Stage 5b) and synthetic department-doctor injection (Stage 5c) steer retrieval toward the right category of content.
Chunk-derived citations — the spoken answer carries no [N] markers (un-pronounceable), so citations are derived directly from the retrieved chunks.
Per-tenant persona — name, voice ID, greeting, and available languages are served from GET /api/v1/voice/persona/{tenant_slug}.
Doctor-profile retrieval boost — a 1.50× boost when a document title starts with Dr. <Name> (see ADR-0057).

See Architecture for the full composition, per-turn flow, feature-flag topology, and latency budget.

Pages in this section

Architecture — the VoiceLLMOrchestrator composition, module layout, per-turn flow, latency budget, and how voice differs from chat
Value Framework — intent-to-category affinity rerank, primary-category prompt guard, wheelchair regression, telemetry
Citation Pipeline — voice citations without inline [N] markers; the cascade fix; cache flush discipline
Voice Safety Architecture — two-stage safety model + Stage 4 disclaimer prepender (rewritten from "triple-defense" framing in Wave 2.C / 2.C-tail D2)
Conversational Intent — derived from classify_terminal regex pre-filter + GPT-4.1 tool-choice (replaces the deleted three-tier resolver)
Language Locking — ADR-0052: locks language at first utterance, eliminates Deepgram mid-call silence failures
Twilio + LiveKit SIP — full telephony architecture, Phase A runbook, Phase B pending
Answer Shaping — TTS-friendly deterministic post-processing (markdown / URL / citation strip, abbreviation expand, time + currency spell-out, sentence cap)
Tenant Overlay System — multi-tenant FAQ + STT mishear corrections via per-tenant overlay package
Adaptive TTS Speed, Context-aware Filler, Listening Ack, Prosody Injection — voice_agent-side UX details
Evaluation — voice golden set and SOTA benchmark harness
Local Setup — developer setup for voice on localhost
Phase B Local — SIP softphone testing before pilot DNS lands
Smoke Test Script — operator script for verifying a deploy

References

ADR-0049: Thin Voice Architecture (stepping stone — superseded by ADR-0051)
ADR-0050: Twilio + LiveKit SIP Integration
ADR-0051: Agentic VoiceLLMOrchestrator is the Only Voice Path
ADR-0052: Voice Language Locked at First Utterance
Lewis et al. 2020 — original RAG architecture; the agentic orchestrator's search_hospital_kb tool is the RAG retriever stage exposed as a tool
Gao et al. 2024 — Modular RAG taxonomy; voice channel is "Modular RAG with agentic dispatcher"
Regulation (EU) 2024/1689, AI Act Article 50 — transparency for interactive AI systems; drives the spoken disclosure requirement at call greeting AND per-turn via the disclaimer prepender. Full text at https://eur-lex.europa.eu/eli/reg/2024/1689/oj.
Belgian ePrivacy Directive transposition — governs call-recording consent; the voice channel is designed record-free

Mission​

Five-layer architecture​

The cognition layer​

Pages in this section​

References​

Mission

Five-layer architecture

The cognition layer

Pages in this section

References