Voice Channel — Overview
A Voice Call, End-to-End traces one real four-turn cardiology appointment call through every stage of this channel — STT, language-lock, the agentic orchestrator, follow-up resolution, the deterministic capability short-circuit, and TTS — with backend cognition captured live from the pilot.
Mission
The voice channel turns every inbound telephone call to ZOL into an intelligent, knowledge-base-grounded conversation — no DTMF menus, no call-center queue for navigation questions. A caller who asks "Waar is de cardiologie?" gets an immediate spoken answer drawn from the same 10 430-chunk corpus that powers the web chatbot, with the same zero-medical-advice safety envelope. The channel is in production on the local stack with Twilio Belgian PSTN number +32460256021 and awaiting pilot DNS + TLS for full external rollout (ADR-0050 Phase B).
The voice surface is the highest-leverage hospital-search differentiator: voice is where the existing call-centre load lives, where elderly callers (the majority of hospital-helpdesk traffic) are most comfortable, and where the regulatory bar — AI Act Article 50(2) per-turn transparency — is hardest to discharge. Discharging it requires the spoken disclaimer prepender (Voice Safety Architecture, Stage 4) and the caller-auditable telemetry stack (Value Framework telemetry, category_mismatch_telemetry).
Five-layer architecture
| Layer | Component | Role |
|---|---|---|
| Telephony | Twilio Elastic SIP Trunk | Owns the Belgian PSTN number; forwards SIP INVITE to LiveKit. (RFC 3261) |
| SIP gateway | livekit-sip container | Translates PSTN μ-law audio to Opus/WebRTC participant |
| Room | LiveKit server (@livekit_agents_docs) | Media relay between SIP participant and agent worker |
| Agent | voice_agent worker (livekit-agents 1.5.6) | Deepgram Nova-3 STT (@deepgram_nova3), per-turn WS call to backend, ElevenLabs Multilingual v2 TTS (@elevenlabs_multilingual_v2) |
| Backend | VoiceLLMOrchestrator (GPT-4.1 with tool-use) | regex pre-filter → LLM tool-call loop → safety post-filter → answer-shaper → disclaimer-prepender |
The cognition layer
The backend VoiceLLMOrchestrator is agentic: a single GPT-4.1 agent with three tools — search_hospital_kb, transfer_to_helpdesk, end_call — decides, per turn, whether to answer directly, search the knowledge base, transfer the caller, or close the call. It calls the same RAGService.query_stream as the chat channel with channel="voice", which activates voice-specific behaviour:
- Value Framework affinity rerank (Stage 5b) and synthetic department-doctor injection (Stage 5c) steer retrieval toward the right category of content.
- Chunk-derived citations — the spoken answer carries no
[N]markers (un-pronounceable), so citations are derived directly from the retrieved chunks. - Per-tenant persona — name, voice ID, greeting, and available languages are served from
GET /api/v1/voice/persona/{tenant_slug}. - Doctor-profile retrieval boost — a 1.50× boost when a document title starts with
Dr. <Name>(see ADR-0057).
See Architecture for the full composition, per-turn flow, feature-flag topology, and latency budget.
Pages in this section
- Architecture — the
VoiceLLMOrchestratorcomposition, module layout, per-turn flow, latency budget, and how voice differs from chat - Value Framework — intent-to-category affinity rerank, primary-category prompt guard, wheelchair regression, telemetry
- Citation Pipeline — voice citations without inline
[N]markers; the cascade fix; cache flush discipline - Voice Safety Architecture — two-stage safety model + Stage 4 disclaimer prepender (rewritten from "triple-defense" framing in Wave 2.C / 2.C-tail D2)
- Conversational Intent — derived from
classify_terminalregex pre-filter + GPT-4.1 tool-choice (replaces the deleted three-tier resolver) - Language Locking — ADR-0052: locks language at first utterance, eliminates Deepgram mid-call silence failures
- Twilio + LiveKit SIP — full telephony architecture, Phase A runbook, Phase B pending
- Answer Shaping — TTS-friendly deterministic post-processing (markdown / URL / citation strip, abbreviation expand, time + currency spell-out, sentence cap)
- Tenant Overlay System — multi-tenant FAQ + STT mishear corrections via per-tenant overlay package
- Adaptive TTS Speed, Context-aware Filler, Listening Ack, Prosody Injection — voice_agent-side UX details
- Evaluation — voice golden set and SOTA benchmark harness
- Local Setup — developer setup for voice on localhost
- Phase B Local — SIP softphone testing before pilot DNS lands
- Smoke Test Script — operator script for verifying a deploy
References
- ADR-0049: Thin Voice Architecture (stepping stone — superseded by ADR-0051)
- ADR-0050: Twilio + LiveKit SIP Integration
- ADR-0051: Agentic VoiceLLMOrchestrator is the Only Voice Path
- ADR-0052: Voice Language Locked at First Utterance
- Lewis et al. 2020 — original RAG architecture; the agentic orchestrator's
search_hospital_kbtool is the RAG retriever stage exposed as a tool - Gao et al. 2024 — Modular RAG taxonomy; voice channel is "Modular RAG with agentic dispatcher"
- Regulation (EU) 2024/1689, AI Act Article 50 — transparency for interactive AI systems; drives the spoken disclosure requirement at call greeting AND per-turn via the disclaimer prepender. Full text at https://eur-lex.europa.eu/eli/reg/2024/1689/oj.
- Belgian ePrivacy Directive transposition — governs call-recording consent; the voice channel is designed record-free