Skip to main content

Voice Channel — Overview

See a real call end-to-end

A Voice Call, End-to-End traces one real four-turn cardiology appointment call through every stage of this channel — STT, language-lock, the agentic orchestrator, follow-up resolution, the deterministic capability short-circuit, and TTS — with backend cognition captured live from the pilot.

Mission

The voice channel turns every inbound telephone call to ZOL into an intelligent, knowledge-base-grounded conversation — no DTMF menus, no call-center queue for navigation questions. A caller who asks "Waar is de cardiologie?" gets an immediate spoken answer drawn from the same 10 430-chunk corpus that powers the web chatbot, with the same zero-medical-advice safety envelope. The channel is in production on the local stack with Twilio Belgian PSTN number +32460256021 and awaiting pilot DNS + TLS for full external rollout (ADR-0050 Phase B).

The voice surface is the highest-leverage hospital-search differentiator: voice is where the existing call-centre load lives, where elderly callers (the majority of hospital-helpdesk traffic) are most comfortable, and where the regulatory bar — AI Act Article 50(2) per-turn transparency — is hardest to discharge. Discharging it requires the spoken disclaimer prepender (Voice Safety Architecture, Stage 4) and the caller-auditable telemetry stack (Value Framework telemetry, category_mismatch_telemetry).

Five-layer architecture

LayerComponentRole
TelephonyTwilio Elastic SIP TrunkOwns the Belgian PSTN number; forwards SIP INVITE to LiveKit. (RFC 3261)
SIP gatewaylivekit-sip containerTranslates PSTN μ-law audio to Opus/WebRTC participant
RoomLiveKit server (@livekit_agents_docs)Media relay between SIP participant and agent worker
Agentvoice_agent worker (livekit-agents 1.5.6)Deepgram Nova-3 STT (@deepgram_nova3), per-turn WS call to backend, ElevenLabs Multilingual v2 TTS (@elevenlabs_multilingual_v2)
BackendVoiceLLMOrchestrator (GPT-4.1 with tool-use)regex pre-filter → LLM tool-call loop → safety post-filter → answer-shaper → disclaimer-prepender

The cognition layer

The backend VoiceLLMOrchestrator is agentic: a single GPT-4.1 agent with three tools — search_hospital_kb, transfer_to_helpdesk, end_call — decides, per turn, whether to answer directly, search the knowledge base, transfer the caller, or close the call. It calls the same RAGService.query_stream as the chat channel with channel="voice", which activates voice-specific behaviour:

  • Value Framework affinity rerank (Stage 5b) and synthetic department-doctor injection (Stage 5c) steer retrieval toward the right category of content.
  • Chunk-derived citations — the spoken answer carries no [N] markers (un-pronounceable), so citations are derived directly from the retrieved chunks.
  • Per-tenant persona — name, voice ID, greeting, and available languages are served from GET /api/v1/voice/persona/{tenant_slug}.
  • Doctor-profile retrieval boost — a 1.50× boost when a document title starts with Dr. <Name> (see ADR-0057).

See Architecture for the full composition, per-turn flow, feature-flag topology, and latency budget.

Pages in this section

  • Architecture — the VoiceLLMOrchestrator composition, module layout, per-turn flow, latency budget, and how voice differs from chat
  • Value Framework — intent-to-category affinity rerank, primary-category prompt guard, wheelchair regression, telemetry
  • Citation Pipeline — voice citations without inline [N] markers; the cascade fix; cache flush discipline
  • Voice Safety Architecture — two-stage safety model + Stage 4 disclaimer prepender (rewritten from "triple-defense" framing in Wave 2.C / 2.C-tail D2)
  • Conversational Intent — derived from classify_terminal regex pre-filter + GPT-4.1 tool-choice (replaces the deleted three-tier resolver)
  • Language LockingADR-0052: locks language at first utterance, eliminates Deepgram mid-call silence failures
  • Twilio + LiveKit SIP — full telephony architecture, Phase A runbook, Phase B pending
  • Answer Shaping — TTS-friendly deterministic post-processing (markdown / URL / citation strip, abbreviation expand, time + currency spell-out, sentence cap)
  • Tenant Overlay System — multi-tenant FAQ + STT mishear corrections via per-tenant overlay package
  • Adaptive TTS Speed, Context-aware Filler, Listening Ack, Prosody Injection — voice_agent-side UX details
  • Evaluation — voice golden set and SOTA benchmark harness
  • Local Setup — developer setup for voice on localhost
  • Phase B Local — SIP softphone testing before pilot DNS lands
  • Smoke Test Script — operator script for verifying a deploy

References

  • ADR-0049: Thin Voice Architecture (stepping stone — superseded by ADR-0051)
  • ADR-0050: Twilio + LiveKit SIP Integration
  • ADR-0051: Agentic VoiceLLMOrchestrator is the Only Voice Path
  • ADR-0052: Voice Language Locked at First Utterance
  • Lewis et al. 2020 — original RAG architecture; the agentic orchestrator's search_hospital_kb tool is the RAG retriever stage exposed as a tool
  • Gao et al. 2024 — Modular RAG taxonomy; voice channel is "Modular RAG with agentic dispatcher"
  • Regulation (EU) 2024/1689, AI Act Article 50 — transparency for interactive AI systems; drives the spoken disclosure requirement at call greeting AND per-turn via the disclaimer prepender. Full text at https://eur-lex.europa.eu/eli/reg/2024/1689/oj.
  • Belgian ePrivacy Directive transposition — governs call-recording consent; the voice channel is designed record-free