Skip to main content

Voice Stack Compendium

A Transferable Architecture for Regulated-Domain Voice Agents

This document is a self-contained engineering description of the voice stack that powers the ZOL hospital intelligent-search pilot. It is written so that an experienced systems engineer who has not previously worked on the project can read it once, understand the cognitive and operational shape of the system, and rebuild a comparable system in an adjacent regulated domain at roughly seventy-percent fidelity without further consultation. The text is paragraph-prose at a graduate-thesis register; reference-style detail lives on the per-feature pages cross-linked throughout. The companion bibliography at /docs/references is the single source of truth for every cited work.

1. Mission and scope

The system was commissioned by Ziekenhuis Oost-Limburg (ZOL), a Belgian acute-care hospital with roughly one hundred thousand monthly website visitors and twenty-five thousand monthly search queries. The presenting problem was that an Elasticsearch keyword search, although technically functional, was not what the population using it actually needed — patients and visitors phrase their questions as natural-language sentences, and a tokenised lexical search returns results that under-index on the natural phrasing in the hospital's own corpus. The voice channel exists for the same reason an enterprise intelligent-search system exists at all: callers reach for the phone instead of the keyboard, especially in the demographic that overlaps most heavily with hospital traffic, and the call-centre is overwhelmed with questions that already have public answers in published brochures.

The class of problem the stack addresses is voice-first AI agents in regulated domains. The features that distinguish this class from chat-first AI are not cosmetic. Voice removes the visual disclaimer surface — a spoken caveat is single-shot, and an elderly caller who mishears the disclaimer cannot scroll back. Voice removes the citation-audit surface — there is no URL the caller can click, no document the caller can verify, so claims must be grounded by retrieval rather than by display. Voice constrains the cognitive budget per turn by the human's tolerance for telephone silence, which empirical work pins around one second for seamless flow and ten seconds for outright loss of attention; tail-latency dominates user experience in a way it does not on chat (Nielsen 1993). Voice also re-introduces a class of bug that text channels do not have, the speech-to-text mishearing that can invert intent at the phoneme level — Dutch "Hoe wordt migraine behandeld?" (a third-person passive, clearly informational query) and "Behandel ik migraine?" (a first-person imperative, clearly an advice-seeking query) differ by two phonemes, and a Flemish-tuned acoustic model errs on the close-call side. The architecture must be designed so that this kind of mishearing does not silently route an advice request through to the cognitive core.

The system is not a clinical decision support tool, and the architecture is shaped by that negative scope as much as by the positive one. We do not provide diagnoses. We do not recommend dosages, treatments, or lines of care. We do not bill, schedule, or transcribe consultations. We do not handle voicemail, multiparty conferences, or any call type that would push the system outside of an informational and navigational role. This is not a deference to a regulator's nervousness; it is an engineering choice that determines which failure modes we must close down and which we are explicitly out-of-scope for. A system that is honestly bounded to "we help you find which department you need and what the visiting hours are" can defend that boundary regex by regex; a system that is loosely scoped as "we are a hospital AI" cannot.

The success-metric framework that governs the stack reflects this scoping. The first-rank metric is zero medical-advice incidents, treated as a regulatory and ethical hard constraint rather than a target. The second-rank metric is first-audio latency at the ninety-fifth percentile, measured at the caller's ear from end-of-utterance to first audible response token; tail-latency is the right denominator here, because the worst one-in-twenty caller experience is what a hospital operator and a CTO actually care about (Beyer et al. 2016 §4 codifies this for service-level objectives in a related but more general setting). The third-rank metric is category-mismatch rate, the fraction of turns where the answer's content category drifts from the caller's intent category — for example, a caller asking about parking and getting back a reimbursement explanation. The fourth-rank metric is citation-grounding rate, the fraction of substantive answer turns that trace to at least one retrieved chunk. The four metrics together form a stack: the safety floor is grounded in regulation, the latency budget is grounded in human attention, and the two quality metrics surface architectural drift before it is heard by callers. The pilot's production telemetry surface, described in section four, instruments all four; the dashboard surfaces them per-tenant on the operations panel.

The architectural lineage of the voice cognitive core is retrieval-augmented generation in the sense of Lewis et al. 2020, with two important adaptations for the voice setting. The first is that the orchestrator is an agentic LLM that calls retrieval as a tool rather than a pipeline that hands retrieved context to the LLM unconditionally. The second is that the orchestrator runs inside a real-time media stack — the LiveKit Agents runtime — so the per-turn budget is not a research-paper batch latency but a phone-call live latency. Both adaptations are described in section three.

The remainder of this document is structured as follows. Section two walks the seven layers of the stack from the public switched telephone network up to telemetry, in order, with the alternatives we considered and why we rejected them. Section three is the cognitive deep-dive — the regex pre-filter, the agentic LLM with three tools, the Value Framework affinity reranker, the citation pipeline, and the structured_call structured-output pattern. Section four is reliability and observability, including the silent-failure discipline that emerged after a 2026-05-07 regression. Section five is safety and compliance, covering the two-stage-plus-disclaimer model, GDPR mapping, AI Act mapping, and adversarial-input hardening. Section six is the reproducibility kit for adjacent domains — phone support, appointment booking, and telemedicine triage. Section seven is the operational replication runbook. Section eight is the bibliography for further reading.

2. The seven-layer stack

The stack composes seven layers, each with a single responsibility. The composition is intentionally rigid; the layers do not bypass each other, and each was chosen against named alternatives. The full per-turn path is shown in figure 2.1 below.

Figure 2.1 — Per-turn path through the seven layers.

Layer 0 — PSTN telephony

The lowest layer of the stack is the public switched telephone network. The hospital owns a Belgian landline number, formatted per the international recommendation ITU-T E.164. The PSTN provider is Twilio Elastic SIP Trunk, configured to forward incoming calls as SIP INVITE over TLS to the pilot server. We chose Twilio over BICS, Voxbone, and direct termination to the Belgian incumbent (Proximus) because Twilio's Belgian number coverage, emergency-routing compliance with the Belgian regulator BIPT, and IP-allowlist hygiene are mature; the alternatives required hand-rolled emergency-routing compliance with no offsetting advantage at our scale. We did not consider managed Twilio voice flows because they tie cognition to TwiML, foreclosing the agent-runtime pattern that the rest of the stack depends on.

Layer 1 — SIP gateway

The SIP gateway is LiveKit SIP running self-hosted in a Docker container on the pilot server. The gateway terminates the inbound SIP INVITE, transcodes the carrier's μ-law eight-kilohertz audio to Opus at forty-eight kilohertz, creates a one-call-per-room LiveKit room, and bridges the audio bidirectionally for the duration of the call. The protocol layer is the canonical SIP specification, RFC 3261.

We considered three deployment shapes: managed LiveKit Cloud SIP, Twilio ConversationRelay, and self-hosted livekit-sip. The decisive factor was cost-per-participant-hour. Managed LiveKit Cloud SIP charges between thirty and fifty cents per participant-hour, which works out to between three hundred seventy-five and six hundred twenty-five euros per month at the projected twenty-five thousand queries per month with average call duration around forty-five seconds. Self-hosted is zero euros marginal, paid for once with the engineering time to operate the container, and zero seconds of caller audio leaves the server except to the outbound STT and TTS vendors that are already subprocessors of record under GDPR Article 28. The runbook for the deployment is ADR-0050 and the operational documentation is Twilio + LiveKit SIP.

Layer 2 — Real-time media

Above the gateway sits LiveKit Server, the WebRTC media relay. LiveKit creates a new room per call, allows the SIP-side and the agent-side to join it as participants, and routes audio between them in real time. The same Server hosts the LiveKit Agents worker that runs the voice_agent process, eliminating a network hop between the media plane and the agent runtime. The vendor reference is the LiveKit Agents documentation.

The choice of LiveKit Agents over alternative agent runtimes (LangChain voice extensions, in-house websocket-on-asyncio) was driven by the integration surface. LiveKit Agents brings native streaming integrations for Deepgram on the inbound and ElevenLabs on the outbound, room-aware participant lifecycle, and the audio-frame buffering required to make turn-taking feel natural. Re-implementing those primitives over plain WebSockets is a four-month project that we elected not to do.

Layer 3 — Speech recognition

Caller audio is streamed to Deepgram Nova-3 for speech-to-text. The vendor announcement is the Deepgram Nova-3 release; we chose it over OpenAI Whisper for three reasons. Nova-3 has a Flemish-tuned acoustic model that handles Limburg and Brussels-Flemish phonetics measurably better in our trials than Whisper-large does — Whisper degrades on regional Dutch in a way that Deepgram does not. Nova-3 is a streaming model, where Whisper is batch by default. Nova-3 has a native LiveKit Agents plugin, where Whisper integration requires a custom audio-buffer adapter.

The non-trivial design decision at this layer is language locking at first utterance, captured in ADR-0052 and described at length in Language Locking. The voice agent uses Deepgram's multi-language mode for the very first utterance only, detects the caller's actual spoken language, then reconfigures Deepgram to that language for the duration of the call. This trade-off was forced by two empirical pilot calls. In one (5c81a578), the agent ran for forty-seven seconds in a gibberish loop before a regex fast-path caught the language mismatch; in the other (fb4b4bae), Deepgram in Dutch mode emitted zero transcripts at all when the caller spoke English mid-call, leaving no signal for any downstream detector. Multi-language Deepgram exists and would solve the silence problem, but it materially degrades Flemish accuracy: prior team measurements showed "Wat zijn de bezoekuren" degrading to "Hå at zen de bezukjuren" under the multi-language mode. The lock-and-stay policy costs us mid-call language switching (rare in our population, well under one percent of calls) and saves the Flemish accuracy of the other ninety-nine percent.

Layer 4 — Cognition

The cognitive core is hosted in the backend. The voice agent forwards each transcript over a WebSocket to the FastAPI backend with the request envelope QueryRequest{channel:"voice", detected_language, query, ...}. The backend dispatches the request to VoiceLLMOrchestrator, which executes a regex pre-filter, an agentic LLM with three tools (described in detail in section three), a regex post-filter, an answer-shaper that converts text-channel formatting to voice-audible prose, and a medical-content disclaimer prepender. The orchestrator is the only voice path; the legacy eight-stage VoiceOrchestrator (Phase A) was deleted in commit 158d793 along with the dialogue manager, the speculative-STT cache, and approximately seven thousand lines of supporting code, per ADR-0049 and ADR-0051. The structured-output discipline at every LLM call site is enforced by the structured_call helper (detailed in section three).

This layer is the most differentiated part of the stack and is treated end-to-end in section three. For a layered understanding: the orchestrator's only direct dependency on the rest of the system is RAGService.query_stream, which it invokes through the search_hospital_kb tool with channel="voice" set. The retrieval pipeline itself is the same one the chat channel uses, in the architectural lineage of Lewis et al. 2020 and the modular RAG taxonomy of Gao et al. 2024.

Layer 5 — Speech synthesis

Outbound audio is rendered by ElevenLabs Multilingual v2, the model card is the vendor documentation. ElevenLabs synthesises Dutch, English, French, and Italian at quality that we found materially better than Azure Cognitive Services and Google Cloud TTS in side-by-side listening tests for our specific population (Flemish-Dutch primary). The lineage of neural TTS architectures back through to Tacotron shapes how we shape input text: prosody and pacing emerge from the model's attention over the input rather than from rule-based prosodic markup, so we control prosody by rewriting input text rather than by emitting SSML. ElevenLabs Multilingual v2 explicitly does not honour SSML break tags; punctuation is the only prosody lever the model exposes, and we use commas, periods, and ellipses to inject the pacing we need. The detail of this is in Prosody Injection and Adaptive TTS Speed.

A second TTS-adjacent feature is adaptive TTS speed, which composes three signals — explicit caller request to slow down, distress detection on the inbound transcript, and the caller's own words-per-minute bucket — into a single speed parameter on the ElevenLabs voice settings. The composition is additive-with-clamp (clamp(discrete + offset, 0.70, 1.00)) rather than multiplicative, because multiplicative composition can cascade into audibly-degraded speech under stacked discounts. Elderly callers, who are the dominant demographic of hospital helpdesk traffic, frequently speak at well under a hundred and ten words per minute; matching their pace at a small offset reads as attuned without crossing into mockery.

Layer 6 — Conversation memory and telemetry

The seventh layer is persistent state. Every turn of every call is written to Postgres in two structured tables: conversation_messages carries the human-readable record (caller utterance, agent answer, timestamp, language, conversational intent, citations) and pipeline_telemetry carries the operational signal (per-stage latency, retrieval cardinality, intent class, primary content category, category-mismatch rate). A third table, category_mismatch_telemetry (migration 066), records per-turn the Value Framework's mismatch metric so that operators can see categorical drift in time-series form.

The telemetry surface is consumed by two readers. The first is the operations dashboard on the frontend, which renders per-tenant trend charts on a Costs tab, including the Category Mismatch Trend and the Diagnostic Accuracy Trend. The second is the diagnostic V2 endpoint (POST /api/v1/query?response_format=v2), which renders per-dimension scoring (correctness, safety, memory, tool_use, latency) for individual calls during evaluation runs. The reliability framing for the metrics is Beyer et al. 2016: latency SLOs are written at the tail (p95, p99), not at the mean, because the worst one-in-twenty caller experience is the one that operators and CTOs actually care about. Section four returns to the telemetry surface in detail.

Layer trade-offs at a glance

LayerChosenAlternative consideredRejected because
L0 PSTNTwilio Elastic SIPBICS / Voxbone / direct Belgian incumbentTwilio BIPT compliance + IP-allowlisting is mature; alternatives required hand-rolled emergency-routing
L1 SIPSelf-hosted livekit-sipLiveKit Cloud SIP$375–625/month at scale; data sovereignty preference
L2 MediaLiveKit Server + AgentsLangChain voice / hand-rolled WSRe-implementing audio-frame buffering, per-room lifecycle, and vendor plugins is multi-month work with no offsetting advantage
L3 STTDeepgram Nova-3 (locked)Whisper-large; Nova-3 multi-languageWhisper degrades on Flemish; multi-language degrades Flemish accuracy across the board
L4 CognitionAgentic LLM with three toolsEight-stage deterministic pipeline; thin pipeline without agentPipeline accreted ~7000 LOC of dead state; thin-without-agent could not handle compound queries
L5 TTSElevenLabs Multilingual v2Azure / Google TTS; ElevenLabs v3v2 has stronger nl/fr/it voices than v3 for our population; SSML support traded away for voice quality
L6 Memory + telemetryPostgres tables + dashboardTime-series DB (Influx, Prometheus)Pilot scale (≤25K queries/mo) does not justify dedicated TSDB; Postgres is already on the bill

Cascade vs. voice-native (speech-to-speech) — and why this stack is a cascade

The single biggest architectural fork is whether to run a cascade (the seven-layer pipeline above: STT → LLM+RAG → deterministic post-processing → TTS, half-duplex) or a voice-native speech-to-speech model (audio in → audio out through one model, full-duplex turn-taking — e.g. OpenAI gpt-realtime, Gemini Live). The choice is a direct trade of control against latency and naturalness.

Latency — voice-native wins, and it is not close. A generic cascade's mouth-to-ear budget is additive across stages — STT ≈350 ms + LLM time-to-first-token ≈375 ms + TTS time-to-first-byte ≈100 ms + media/buffering hops ≈ 1.1 s median (Twilio, Nov 2025). Voice-native models collapse those hops into one pass: 0.90 s for OpenAI gpt-realtime, 1.14 s Gemini Live, 1.15 s xAI under realistic audio (τ-Voice, arXiv 2603.13686), and a research S2S system reports 81 ms P90 onset vs 1,091 ms for its cascaded baseline. Human-natural full-duplex conversation wants sub-400 ms (FLEXI, arXiv 2509.22243) — a bar no cascade reaches. ZOL's own pilot is far slower than even the generic cascade budget: measured time-to-first-audio is p50 ≈ 5.9 s, p95 ≈ 16.4 s (fresh turns 7–8 s), because ZOL adds two stages the generic budget omits — an agentic tool-decision LLM call (~2.5–3.3 s in the trace) and RAG retrieval+rerank (~0.6–1 s). Cached/deterministic turns return in ~1–2 s, so a prefetch cache reclaims much of the gap (VoiceAgentRAG, arXiv 2603.02206, reports a 316× speedup on cache hits; ZOL already runs a prefix-warm cache) — but the cascade structurally cannot match a duplex model on a fresh turn.

Control — the cascade wins, and this is decisive for a hospital line. The cascade assembles and screens the exact output text (citation derivation, medical-advice regex, disclaimer prepend) before a single byte reaches TTS. Voice-native guardrails cannot do this: in OpenAI's Agents SDK, realtime agents support output guardrails only, they run on debounced transcript (≈100-char default, not every token), and they fire after audio is already buffered or playing — the docs state plainly that on a tripwire you must "stop local playback immediately, because … some audio may already be buffered when the tripwire fires." That is reactive, not pre-emptive. Load-bearing for ZOL specifically: realtime output guardrails are JS/TS-only; the Python Agents SDK realtime path does not yet support them (openai-agents-python #1912) — and ZOL is a Python/FastAPI stack.

Reliability — voice-native is not yet good enough for a hard bar. Obeying "never give medical advice, always cite" is an instruction-following task, and gpt-realtime scores only 30.5 % on the audio MultiChallenge benchmark. On grounded multi-step tasks, full-duplex voice agents complete 26–51 % vs 85 % for the best text agent (τ-Voice), with 79–90 % of failures agent-attributed — including hallucinated completions ("I've updated your address" with no tool call). And safety regresses in the audio modality: identical malicious prompts that a model refuses 86 % of the time as text are refused only 37 % as speech — a ~49-point drop (VoiceBench, arXiv 2410.17196).

Hybrid is feasible but does not restore determinism. gpt-realtime supports function/MCP tool calls, so a voice-native model can call a RAG retriever, and the JS SDK offers a first-class RealtimeOutputGuardrail. But the SDK concedes guardrails "often" (not always) cut off unsafe output before the listener hears it — best-effort timing, not a guarantee. A hybrid narrows the determinism gap; it does not close it.

Decision framework. Stay cascade when the spoken output carries a hard, non-negotiable constraint (medical-advice refusal, mandatory citations, auditability) — i.e. the ZOL hospital line. Consider voice-native when the output constraint is soft and latency/naturalness dominate the UX — e.g. an appointment-booking line (see §6.2), where control can be concentrated at the tool boundary rather than over every spoken sentence. ZOL should re-evaluate the switch only when a duplex (or hybrid) model can demonstrate, on a Dutch hospital-corpus benchmark: (1) pre-speech-deterministic refusal/grounding (guardrails that gate audio before playback), and (2) closure of the multi-step task-completion gap — and when the Python realtime SDK gains output-guardrail support. The latency/naturalness gains alone do not justify accepting non-deterministic safety on a hospital line.

Fast-moving — re-benchmark before any go/no-go

These figures pin Aug-2025 / early-2026 models. Successors are improving instruction-following materially (a reported GPT-Realtime-2 reaches ~48.5 % on Audio MultiChallenge vs 30.5 %), so the gap is narrowing. Treat the numbers as a snapshot and re-measure against a hospital-relevant eval before deciding.

A third option: the full-duplex cascade

The choice is not strictly binary. Most of a duplex model's felt naturalness comes not from raw latency but from turn-taking — the caller can interrupt (barge-in), the agent does not force the caller to wait for a full utterance, neither party is blocked. A cascade can adopt those turn-taking behaviours while keeping its defining property: the deterministic text checkpoint before TTS. Call this a full-duplex cascade — the seven-layer pipeline plus (a) VAD-based barge-in (the caller can speak over the agent; TTS stops and the agent re-listens), (b) streaming STT + streaming TTS so the agent speaks sentence-by-sentence as text is produced (ZOL already runs this — voice_streaming_enabled=true), and (c) latency masking via the prefix-warm cache and the filler ladder.

What it buys, and what it cannot:

  • Buys: the interruption / turn-taking naturalness callers actually notice — without surrendering the pre-speech text gate. Grounding, citations, and hard refusal stay deterministic.
  • Cannot: close the raw fresh-turn latency gap. A serial STT → LLM → RAG → TTS chain cannot reach the sub-400 ms full-duplex tier; barge-in changes whether the caller is blocked, not how fast the first token arrives. The prefetch cache attacks latency; barge-in attacks turn-taking — they are complementary, not substitutes.
  • Costs: real engineering complexity exactly where this team has been before — VAD precision, echo cancellation, and distinguishing a genuine interruption from a backchannel ("mhm", "ja"). Getting that wrong makes the agent feel jumpy; it is the surface the earlier acknowledgment/filler experiments probed.

For a regulated line, the full-duplex cascade is the recommended roadmap option: it is the only one of the three that improves naturalness while preserving cascade-grade control. The trade it cannot make is raw-latency parity with speech-to-speech — and for a hospital information line bound by a zero-medical-advice constraint, that is the correct trade. The standing re-evaluation trigger from the previous subsection still applies: adopt a black-box voice-native model only once it can demonstrate pre-speech-deterministic grounding/refusal and close the multi-step task-completion gap on a Dutch hospital benchmark.

3. Cognition deep-dive

The cognitive core is the most differentiated part of the stack and is therefore the part most worth describing in detail. It is also the part most likely to need adaptation when the stack is replicated to a new domain. The decisions in this section are load-bearing for both the safety story (section five) and the replicability story (section six).

The architecture is agentic-only, captured in ADR-0051. A single GPT-4.1 agent with three tools is the cognitive layer, fronted by a deterministic regex pre-filter and trailed by a deterministic regex post-filter and a deterministic answer-shaper. There is no dialogue manager. There is no intent classifier in the legacy sense — the LLM agent picks tools on a per-turn basis. There is no fallback path from the agent to a non-agentic pipeline; if the LLM fails, the orchestrator returns conversational_intent="escalate" and the voice agent transfers the caller to the helpdesk via SIP REFER. This is a narrower contract than the previous architecture exposed, and it is the one we ship.

Figure 3.1 — Per-turn cognitive flow.

Stage 1 — The regex pre-filter

The pre-filter is voice_thin_pre_filter.classify_terminal() (backend/app/services/voice/voice_thin_pre_filter.py:338). Every caller utterance, after STT and after PII redaction, is run through a deterministic regex classifier before the LLM is invoked. The classifier returns one of seven values from the TerminalClass enum (voice_thin_pre_filter.py:41):

SAFETY_REFUSAL, HANDOFF_REQUEST, REPEAT_REQUEST, OFF_TOPIC_PERSONAL, FAREWELL, GREETING, and FALLTHROUGH. The first six terminate the turn at the pre-filter — a fixed templated response is emitted, the LLM is never invoked, and the turn closes. The seventh hands the utterance to the LLM agent.

The precedence ordering matters and is the safety-critical part of the pre-filter. A caller who says "Bedankt voor de informatie, kunt u me toch nog even doorverbinden?" must be classified as HANDOFF_REQUEST, not FAREWELL — the cascade is structured so that handoff and safety fire before social closers. The Dutch safety pattern at line 170 catches dosage and prescription phrasings (hoeveel … nemen with a forty-character gap, welke medicatie / pil / medicijn, welk medicijn, welke dosis). Equivalent patterns exist for English, French, and Italian in pattern packs of seventy-seven, eighty-eight, and one hundred two alternations respectively; the multi-language coverage is non-optional, because a tenant whose patient population speaks French must have French dosage-ask coverage from day one. The framing for turn-taking and conversational structure that the cascade implements is the canonical Sacks, Schegloff, and Jefferson 1974 — voice agents implement (or violate) the turn-taking organisation of natural conversation whether their designers know it or not.

The pre-filter is the system's hardest safety guarantee. A recognised dosage ask cannot reach the LLM at all; it is intercepted by the regex and answered with a fixed refusal-plus-helpdesk-offer template. The forty-character bounded gap on the Dutch hoeveel … nemen pattern is a deliberate cap on blast radius: a sentence with forty-one or more characters between "hoeveel" and "nemen" is more likely a navigational question ("hoeveel tijd nemen jullie voor een eerste consultatie?" — how long do you take for a first consultation) than a dosage question ("hoeveel moet ik daarvan nemen?").

Stage 2 — The agentic LLM with three tools

When the pre-filter returns FALLTHROUGH, the orchestrator invokes GPT-4.1 with a system prompt and a three-tool schema (voice_llm_orchestrator.py:63 defines _TOOLS). The tools are:

ToolPurposeSafety role
search_hospital_kbWraps RAGService.query_stream(channel="voice"); returns voice-shaped answer + citations + a found booleanForces claims to be grounded in retrieved chunks rather than invented from training data
transfer_to_helpdeskShort-circuits to a SIP REFER-style escalationAllows the agent to bail out when it cannot safely answer
end_callCloses the call after an explicit goodbye markerPrevents the LLM from hanging up on intermediate utterances

Three controls keep the LLM honest at this stage. The first is system-prompt invariantsapp.prompts.build_voice_llm_orchestrator_system_prompt establishes three hard rules: never answer ZOL-specific facts from training data, never give medical advice, and keep responses to one or two sentences. Per the OWASP LLM Top 10 practitioner taxonomy, this is LLM01 (prompt injection) mitigation by design — explicit invariants restated at every turn. The second is tool-grounded retrieval — the prompt is shaped to require search_hospital_kb for any factual claim. When the search returns found=False twice consecutively in a single turn, the orchestrator force-transfers to the helpdesk (voice_llm_orchestrator.py:519-556); this defends against the gibberish-rephrase loop pattern observed in pilot traffic on 2026-05-07. The third is the iteration cap voice_llm_orchestrator_max_tool_iterations (default three). On overflow the orchestrator emits a fixed transfer text rather than continuing to spend tokens.

Structured output at every LLM call site is enforced by the structured_call helper (app.llm.structured, ~190 LOC over the raw AsyncOpenAI client): a typed-output wrapper with a schema validator and a bounded retry budget. It superseded raw OpenAI chat completion plus response_format={"type":"json_object"} plus manual json.loads() across eight LLM call sites, including IntentClassificationService.classify_intent, ConversationClassifierService, and VoiceTurnEvaluator. (A Pydantic AI Agent of the same shape was trialed on 2026-05-09 but removed 2026-05-12, commit b8d8da67, after production telemetry showed its Agent.run() added ~720 ms per call — the load-bearing case study behind the methodology v2.3 Brainstorm Gate; see Decision-Cost Rubric.) The change is invisible at the API boundary, but it eliminates a class of silent failures: malformed JSON no longer reaches downstream code, because the helper retries on invalid output and raises a typed StructuredCallError on exhausted retries, with a defensive _legacy_* fallback path at each call site as a defence-in-depth signal.

The Value Framework — affinity rerank, primary-category guard, unit-mismatch admission

The Value Framework is the customer-facing differentiator of the retrieval pipeline. It exists to close a class of cross-category contamination that pure vector similarity cannot detect. The motivating regression is documented in Value Framework: a caller asking about wheelchair accessibility at the hospital entrance received a confused reimbursement-process explanation, because the orthopedic chunk about wheelchair-prescription reimbursement lexically over-scored on "rolstoel" (wheelchair) and outranked the parking/accessibility chunk. The pgvector cosine similarity (@pgvector_docs) on a 3072-dim embedding (@openai2024embeddings) was doing its job; the problem was that similarity alone could not distinguish "I want to park my wheelchair" from "I want to claim reimbursement for my wheelchair prescription."

We considered three retrieval-side mitigations and rejected each. Adding BM25 keyword search to the hybrid retriever was already in place and made the contamination worse — keyword recall surfaced the regulatory chunk on "rolstoel" harder, not softer. Reciprocal Rank Fusion across vector and BM25 preserved the contamination, because both branches independently rank the regulatory chunk high; rank-fusion without normalisation cannot distinguish a high-ranking-correctly chunk from a high-ranking-incorrectly chunk. A BERT-style cross-encoder reranker would help, but at fifty milliseconds per inference on a Jina v2 cross-encoder over the candidate set, it adds material latency to a voice turn and still has no signal that "parking" and "reimbursement" are different categories of fact.

The Value Framework is structurally a categorical-affinity multiplier applied to the existing similarity score. It is descended from the rank-fusion lineage but extends it with a per-intent-class × per-category multiplier matrix. In the modular-RAG taxonomy of Gao et al. 2024, it is a specialised retrieval-augmentation module orchestrated alongside the existing retriever, not replacing it. It is structurally adjacent to the knowledge-graph + vector hybrids surveyed in Sarmah et al. 2024, although the "graph" we use is the Postgres taxonomy rather than a dedicated graph DB.

The framework runs three operations per turn. Affinity reranking classifies each retrieved chunk into one of six categories (practical, clinical_info, regulatory, appointments, legal_admin, general) using multi-language word-boundary regex. The classification is hospital-agnostic — the keyword sets are linguistic, not specific to ZOL — and is cached on the chunk dict so subsequent operations read the same classification. The chunk's similarity score is then multiplied by an entry from the seven-intent × six-category affinity matrix (the default matrix is reproduced in section seven). The maximum boost is 1.30 and the maximum penalty is 0.55; the boundary values were chosen so that even on a worst-case ranking inversion, the boost can flip the order. Primary-category election identifies the dominant category among the top five reranked chunks by cumulative similarity (not chunk count) and feeds it into the LLM prompt as an instruction not to fuse across categories — a prompt-level guard that pairs with the score-level rerank to handle the residual cases where mixed retrieval still occurs. Unit-mismatch admission detects when the query asks about a per-minute or per-session unit (parking tariffs) but the top chunks discuss per-kWh or per-item pricing (EV charger costs); when this fires, the framework injects a structured [UNIT MISMATCH] note into the context, instructing the LLM to say so explicitly rather than silently transposing units.

The framework's per-turn telemetry write to app.category_mismatch_telemetry (migration 066) is what makes the operational story tractable. The write captures intent class, primary category, mismatch rate as a fraction of top-five chunks below the boost threshold, total chunks evaluated, and a 200-character query preview (PII-redacted on the voice path before logging). A sustained spike in mismatch rate — for example, sixty percent for fifty consecutive turns — indicates an emerging query class that the affinity table does not cover, and the operational fix is adding a new intent row to the default affinity map. The full latency overhead of the framework is in the five-to-twenty-five millisecond range per turn on dev hardware, which is a small line item against the six-hundred-to-twelve-hundred millisecond RAG inner loop.

The wheelchair-regression test suite at backend/tests/integration/services/test_value_framework_wheelchair_regression.py is the codified contract for the framework — eight tests that pin the end-to-end behaviour for the original conflation scenario plus the unit-mismatch detector. This is the R2 silent-failure-discipline regression-pin (see section four): a test that lives WITH the fix, asserts the user-visible post-state, and would catch the regression on day one if it returned.

The citation pipeline — chunk-direct fallback for the voice channel

Voice answers contain no inline [N] citation markers. The voice system prompt explicitly strips them, because TTS reads [1] as "open bracket one close bracket". This is the right design choice for the user, but it created a silent cascade failure that affected every voice turn until 2026-04 when it was fixed across three commits.

The chat channel's citation extractor _qs_extract_citations pattern-matches [N] markers in the answer text. Without markers, it returned an empty list. The dedup helper _qs_deduplicate_citations saw an empty list and returned an empty list. The cache writer _qs_write_citation_cache wrote (answer_hash, citations=[]) to the semantic-query cache. Future cache hits on the same query returned citations=[]. The v2 diagnostic endpoint then read empty citations from the cache, failed schema validation on dimensional_scores={}, and silently fell back to v1 rendering. Nothing in this cascade raised an exception; every function behaved correctly for its stated inputs; the failure was architectural — the chat path's assumption (markers exist) propagated into the voice path without a guard.

The fix landed in three commits. Commit d130df74 added a voice fallback in _qs_finalize: when channel == "voice" and the marker extractor returns empty, citations are derived directly from the retrieved chunks that were used to build the context. Commit 3cd5cc2f made _qs_deduplicate_citations skip the dedup step on chunk-derived citations cleanly. Commit 11a51ab2 added the R1 log line voice_citations_written, count=... immediately before the cache write, so a count of zero is now a visible signal in the log stream, not a silent empty write. After any change to the citation pipeline the semantic-query cache must be flushed, because stale cache entries from before the fix will continue to serve citations=[] on cache hits until they expire.

This is the kind of regression that the silent-failure discipline (section four) is designed to surface. The fix landed with the regression-pinning test, and the R1 log line means an operator can spot a recurrence from logs alone without querying the database.

The structured_call structured-output pattern

The eight LLM call sites that return structured JSON share a uniform pattern: the structured_call(prompt, output_model) helper (app.llm.structured) wraps the OpenAI client with response_format=json_object, validates via pydantic.BaseModel.model_validate_json(), and retries on a ValidationError before raising a typed StructuredCallError, with a defensive _legacy_* raw-OpenAI fallback. This eliminated three classes of bug. Malformed JSON no longer reaches downstream code, because the helper retries on invalid output and raises on exhausted retries. The pre-helper code accreted defensive try / except json.JSONDecodeError blocks that masked malformed-output rates; the current code surfaces them at the operational signal layer. And the VoiceTurnEvaluator (the per-turn LLM-as-judge scorer that feeds the v2 diagnostic) used to occasionally produce a malformed score that silently fell back to v1 rendering with dimensional_scores={}; the helper eliminates that failure mode. (A Pydantic AI Agent of the same shape was trialed on 2026-05-09 but removed 2026-05-12, commit b8d8da67, after telemetry showed Agent.run() added ~720 ms per call — see Decision-Cost Rubric.)

Stage 3 — The post-LLM regex safety filter

After the LLM has produced its final text response, the post-filter _safety_post_filter() (voice_llm_orchestrator.py:844) runs the response through _MEDICAL_ADVICE_RE (defined at line 181). The regex covers three classes of medical-advice slip across all four supported languages: diagnosis commitment ("u heeft (waarschijnlijk) griep"), dosage or drug recommendation (numeric \d+ mg, neem twee tabletten), and first-aid prescription ("druk stevig op de wond"). On any match, the post-filter logs voice_llm_post_filter_triggered at WARNING and replaces the LLM output with the language-matched safety-refusal template. This is the belt-and-braces guarantee: even if the LLM ignores the system prompt, the regex strips the offending content before TTS.

The post-filter complements rather than replaces the pre-filter. The pre-filter catches caller asks before the LLM runs; the post-filter catches LLM slips after the LLM runs. Their failure modes are different: the pre-filter can have false negatives on novel paraphrases of dosage asks, the post-filter can have false negatives on novel paraphrases of dosage answers. The two together close most of the failure surface, but neither is a complete defence — that is what the disclaimer at the next stage exists to handle.

Stage 4 — The medical-content disclaimer prepender

The shaper voice_answer_shaper.shape() carries the per-language disclaimer prepend ("Ter informatie, dit is geen medisch advies — …" and the en/fr/it equivalents). Until 2026-05-09 the orchestrator was hard-coding medical_intent_detected=False on every voice turn, a wire left dangling when commit 158d793 deleted the legacy intent classifier. Wave 2.C Decision 2 re-activated the prepender by introducing _detect_medical_content_in_answer(), which inspects the assistant's actual answer text (after RAG, after the LLM, after markdown / URL / citation strip) and decides whether to prepend the disclaimer.

The mechanism is deliberately post-LLM answer-text inspection, not pre-LLM intent guessing. The regex evaluates what the system is about to say, not what we predicted the caller meant. This is structurally stronger: a caller asks about "how to read a CT scan report" (informational, technically navigational) and the LLM produces an answer that names cancers and treatments — pre-LLM intent guessing would skip the disclaimer; post-LLM answer inspection catches it. Each language carries its own pattern pack covering six clusters of medical vocabulary (body/condition/disease/injury, symptoms, treatment/therapy/medication/surgery, diagnostic vocabulary, specialist roles, care-domain names). The current logic is intentionally medical-dominant: any medical-pattern hit triggers the disclaimer regardless of co-occurring navigational vocabulary. The regulatory cost of under-disclaim (AI Act Article 50(2)) far outweighs the user-experience cost of over-disclaim. Each invocation logs the disclaimer decision at INFO so operators can spot over-firing or under-firing from logs alone.

4. Reliability and observability

The reliability story is anchored on three rules codified in CLAUDE.md after the voice-history regression of 2026-05-07. The rules are written narrowly because they emerged from a specific incident, but they generalise: they describe what it takes to make a system that fails quietly fail loudly instead.

The silent-failure discipline (R1 / R2 / R3)

R1 — observability for collection-returning functions. Any function that returns a list, dict, or generator must log its size at INFO immediately before the return. One log line. The cost is microseconds; the debugging payback when the function returns zero unexpectedly is enormous. The voice-history regression of 2026-05-07 shipped because a function that returned an empty list on a permission mismatch had no size log, and the operational evidence that the call was returning zero results was buried two layers deeper than anyone would naturally look. The fix was a single logger.info("voice_history_loaded", count=len(history)) line. The discipline is now repo-wide: any list-returning function that exits a service boundary logs its size.

R2 — regression-pinning tests for silent-failure branches. Every code path that fails quietly — empty fallbacks, NULL filters, exception swallowers — needs a test that asserts the user-visible post-state matches the documented behaviour. The test goes IN with the fix, not later. The voice-history regression was a NULL-equality filter that excluded the rows we wanted; the fix added the corrected filter and a regression test that pinned "NULL user_id rows are loaded for voice conversations" as a contract. The wheelchair-regression test suite (eight tests, documented in section three) is another instance of the same discipline.

R3 — contract tests for cross-component shared state. When two components share protocol state — for example, the voice agent's _current_language and the backend's QueryRequest.detected_language, or any handoff over WS, HTTP, or DB schema — there must be a test that simulates the wire format and asserts both sides agree. The test lives wherever it can run both sides; if neither side owns it cleanly, it gets its own integration-test file. The voice-language plumbing has such a test at backend/tests/integration/services/voice/test_voice_llm_orchestrator_integration.py::test_detected_language_from_voice_agent_is_respected_by_orchestrator — it sends a QueryRequest with detected_language="nl" and asserts the orchestrator reads the field rather than re-inferring or overriding.

The three rules close the loop. R1 surfaces silent failures to logs so they are discoverable. R2 catches them in CI before they ship. R3 catches them at component boundaries where most of them live.

The per-turn telemetry surface

The telemetry that operators consume is written to three Postgres tables. pipeline_telemetry carries per-stage latency, retrieval cardinality, and confidence scores. category_mismatch_telemetry carries the Value Framework's per-turn category-mismatch metric (intent class, primary category, mismatch rate, chunks total, chunks off-category, query preview). conversation_messages carries the human-readable turn record (utterance, answer, language, conversational intent, citations).

The dashboard surface is on the /analytics/system page in the Operations tab, with two trend charts of immediate operational interest: Category Mismatch Trend and Diagnostic Accuracy Trend. The trends are per-tenant; the multi-tenant SaaS architecture isolates the numbers for each hospital so a regression in tenant A is not masked by traffic in tenant B. The reading frame is the SLO discipline of Beyer et al. 2016: metrics are written at the tail (p95, p99), not at the mean. A mean of one hundred fifty milliseconds with a p95 of two seconds is a system in trouble, even though the mean looks fine.

A complementary signal lives in the structured logs themselves. The R1 log lines are designed so that a grep over the log archive can answer per-turn quality questions without a database query. voice_citations_written count=0 immediately tells an operator a turn shipped without citations. voice_disclaimer_decision detected=True prepend=True immediately tells an operator the disclaimer fired. voice_llm_post_filter_triggered at WARNING is the operational signal for a Stage 3 fire — a Stage 3 fire is rare in calm traffic and is interesting when it spikes.

Diagnostic V2 — per-dimension scoring and adversarial pass

The v2 diagnostic endpoint (POST /api/v1/query?response_format=v2) is used during evaluation runs and during operator triage of individual calls. It scores each turn along five dimensions: correctness (does the answer match the ground truth), safety (does it slip into medical advice), memory (does it use prior conversation context appropriately), tool_use (does it pick the right tool), and latency (does it stay within the budget). The scoring is done by an LLM-as-judge call site (VoiceTurnEvaluator), which is one of the eight structured_call structured-output sites described in section three.

The diagnostic also runs an adversarial counter-evidence pass — given an answer, it searches the corpus for chunks that contradict the claim, and surfaces them to the operator if any are found. This is structurally important for the safety story: an answer that looks correct against its supporting chunks but contradicts other chunks elsewhere in the corpus is a citation-grounding success but a corpus-coherence failure. The adversarial pass is bibliographically adjacent to but not directly modelled on a published paper; it is an internal pattern that emerged from operator triage.

Per-claim grounding decomposes the answer into atomic claims and labels each as grounded, inferred, or speculation. grounded claims trace to a retrieved chunk; inferred claims are reasonable extrapolations from chunks but not literally stated; speculation is what the LLM should never produce on the voice path. The proportion of each per call is a quality signal that complements the categorical-mismatch metric.

5. Safety and compliance

Safety on the voice channel is a two-stage filter plus a tool-grounded LLM in the middle plus a post-LLM disclaimer prepender. This supersedes an earlier "triple-defense" framing that referenced two modules — stt_ambiguity_guardrail.py and voice_safety_gate.py — that were deleted in commit 158d793. The current architecture is the production architecture; the framing in this section is the framing that matches the code.

Figure 5.1 — Voice-channel safety stages.

The two-stage-plus-tool-grounding model

Stage 1 is the regex pre-filter. It catches the safety-critical caller asks (dosage, prescription, "what should I take?") before the LLM is invoked. The SAFETY_REFUSAL class returns the language-matched fixed response from _SAFETY_RESPONSES (voice_llm_orchestrator.py:226), offering the helpdesk transfer plus a fallback to the caller's GP, out-of-hours service, or the Belgian emergency number 112. A recognised dosage ask cannot reach the LLM at all. The patterns target medical-advice phrasings narrowly enough that benign navigational queries ("where can I find information about X?") fall through to the agentic path.

Stage 2 is the agentic LLM with tool-grounded retrieval. The system prompt establishes hard invariants (no ZOL-specific facts from training data, no medical advice, one-or-two sentences). The tool_choice="auto" setting plus a system-prompt instruction to require search_hospital_kb for any factual claim means the LLM's answer must trace to a chunk in app.document_chunks. Two consecutive search_hospital_kb calls returning found=False force-transfer to the helpdesk. The tool loop is bounded by a three-iteration cap. Per the OWASP LLM Top 10, this is the practitioner mitigation for LLM01 (prompt injection) (system-prompt invariants restated every turn), LLM03 (training-data-induced misinformation) (tool-grounded retrieval forces claims through the corpus), and LLM06 (sensitive information disclosure) (the corpus is curated, no patient data is in scope).

Stage 3 is the regex post-filter. It runs on the LLM's output, before the answer-shaper, and catches three classes of medical-advice slip in all four languages: diagnosis commitments, numeric dosages, and first-aid imperatives. On a match it replaces the output with the safety-refusal template.

Stage 4 is the medical-content disclaimer prepender. It runs after the answer-shaper, inspects the final answer text, and prepends "Ter informatie, dit is geen medisch advies — …" (and language equivalents) if any of the six clusters of medical vocabulary fire. Mixed answers ("Cardiology is on floor four, parking is in P3") fire the disclaimer — the safe direction. The regulatory cost of under-disclaim (AI Act Article 50(2)) far outweighs the user-experience cost of over-disclaim.

STT-mishearing awareness — the "afwrak" class of bug

A specific class of bug deserves naming. Dutch first-person-imperative-with-medical-entity inversions like "Behandel ik migraine?" ("Do I treat migraine?") are a few phonemes away from "Hoe wordt migraine behandeld?" ("How is migraine treated?"). The first is advice-seeking, the second is informational. A Flemish-tuned acoustic model errs on the close-call side. We have observed this class of bug under both forms — caller speaks the safe phrasing, transcript renders the unsafe phrasing, and conversely — and the only complete defence is to treat the inversion as in-scope for the Stage 1 pattern set, regardless of which phrasing arrived in the transcript. The Dutch pattern pack at voice_thin_pre_filter.py:170 includes the inverted forms; the multi-language packs do likewise for English, French, and Italian.

GDPR mapping

The processing of caller data is governed by Regulation (EU) 2016/679, the General Data Protection Regulation. The mapping of the regulation's articles to the voice stack is:

  • Article 4(5) — pseudonymisation. The voice channel pseudonymises caller-ID before logging via voice_pii_redaction.py. Phone numbers, full names, and other inline PII are replaced with structural tokens before any structured-log emission.
  • Article 5 — principles. Lawfulness (Art. 6), purpose limitation (the pilot processes calls only for the search-tool purpose), and storage limitation (conversation history is retained per the data-retention policy in Data Retention Policy) are all instantiated in the architecture.
  • Article 6 — lawful basis. The processing rests on legitimate interest (Art. 6(1)(f)) and public interest (Art. 6(1)(e)), not on consent. The hospital has a public-service duty to make its services accessible; the interest passes the three-part test (purpose, necessity, balancing).
  • Article 9 — special-category data. The system is deliberately scoped out of special-category data. The corpus is general-public hospital information, not patient health records. Inadvertent special-category content in user queries (a caller mentioning their symptoms) is processed under Art. 9(2)(h) (provision of healthcare) and Art. 9(2)(i) (public health) jointly.
  • Article 25 — data protection by design and by default. PII redaction at ingest, minimal-data logging, and purpose-bound retention are design-time choices, not runtime mitigations.
  • Article 28 — processor relationships. Twilio (PSTN), OpenAI (LLM and embeddings), Deepgram (STT), and ElevenLabs (TTS) are subprocessors of record, with appropriate data-processing agreements in place.
  • Article 30 — records of processing. The audit log surface is described in Security.
  • Article 32 — security of processing. TLS termination at every hop, Keycloak-backed access controls on the admin surface, and the ISO/IEC 27001:2022 ISMS framework as the alignment target.
  • Article 35 — DPIA. The full DPIA is at DPIA; a short summary is that the proactive DPIA was conducted because of the healthcare context (data concerning health is special-category data even when nominally about hospitals rather than patients), even though the literal corpus is public-domain.

AI Act mapping

The processing also falls within the scope of Regulation (EU) 2024/1689, the EU AI Act. The classification is:

  • Limited-risk under Art. 50. The system is an AI system that interacts with natural persons; Art. 50(1) requires that the user be informed they are interacting with AI unless this is obvious from context. The voice channel discharges this at the greeting layer (the "we are an information assistant" statement at call open) and again per-turn via the Stage 4 disclaimer prepender on any answer that matches the medical-content pattern packs.
  • Not high-risk under Art. 6 + Annex III. The system is informational and navigational only; it is not a decision-support system in the AI Act's sense. The negative classification is anchored on the same architectural choices that anchor the negative MDR classification below.
  • Not a medical device under Regulation (EU) 2017/745 (the Medical Device Regulation). Article 2(1)'s definition of a medical device does not match a hospital wayfinding tool; Annex VIII Rule 11 software classification confirms that an information-and-navigation tool is not regulated as software-as-a-medical-device. The negative classification is the load-bearing legal posture for the system's operation; if any of the safety stages fail in production and the system starts producing literal medical advice, the classification flips and the system would require CE-marking. This is why the safety stages are anchored at the regex layer rather than at the LLM layer alone.
  • Lineage — HLEG 2019 Ethics Guidelines for Trustworthy AI informed the AI Act's ethics framing. The seven HLEG principles (human agency, technical robustness, privacy, transparency, diversity, well-being, accountability) map onto AI Act Articles 13-15 and Article 50.

Adversarial input hardening

The adversarial threat model addresses two attack classes from the recent literature. The first is GCG-style adversarial suffixes (Zou et al. 2023) — high-entropy character sequences appended to a query that probabilistically jailbreak aligned models. The second is the generative extension of the first, AmpleGCG, which trains a generative model of adversarial suffixes; the practical takeaway is that the threat class is not static, and defences must assume a steady stream of new suffix patterns rather than a fixed adversarial vocabulary.

The defence is regex-based, surface-signature detection: high-entropy character sequences and character-level perturbations are pattern-matched at the pre-filter stage and are routed to the safety-refusal branch. This is a deliberately incomplete defence — a sufficiently sophisticated attacker can craft suffixes that evade regex — but the threat-actor model for a public-hospital wayfinding tool is not a researcher with white-box gradient access. The asymmetric cost of an adversarial-jailbreak success is regulatory (the system slips into medical advice), and the asymmetric cost of an adversarial-jailbreak failure for the attacker is zero (they hang up); the defence does not need to be perfect, it needs to be unfun.

6. Reproducibility kit — what changes for a new domain

The seven-layer stack and the cognitive deep-dive are domain-shaped only at four points: the safety pattern packs, the corpus, the tool set, and the affinity matrix. Everything else is portable. This section enumerates what changes for three adjacent domains.

6.1 — Phone-support spinoff

What stays the same. The seven-layer stack, the thin-pipeline shape (regex pre-filter → agent → regex post-filter → answer-shaper), the telemetry surface, the multi-tenant overlay system, the structured_call structured-output discipline, the SIP-to-LiveKit-to-agent media path, language locking at first utterance, the prosody injection and adaptive TTS speed, and the silent-failure discipline (R1/R2/R3).

What changes. The medical-advice safety layer is dropped or inverted. For a product-support phone agent, the equivalent constraint is "we provide product information, not legal advice"; the regex packs change from medical vocabulary to legal-advice vocabulary (warranty interpretations, refund-promise commitments, contract-term explanations). The hospital taxonomy is replaced with a product hierarchy (categories, SKUs, troubleshooting trees, common-fault FAQs). The tool set changes: search_hospital_kb becomes search_product_kb, transfer_to_helpdesk becomes transfer_to_human_agent, and a new lookup_customer_account tool is added that integrates with a CRM (Salesforce, HubSpot, or similar). A create_ticket tool replaces the SIP REFER pattern when the agent cannot resolve in-call. The Value Framework's six categories are replaced with a domain-appropriate set: practical (shipping, returns, hours), product_info (specs, compatibility, accessories), troubleshooting (steps, fixes, escalation), policy (warranty, refund, exchange), account (orders, billing, profile), general (fallback). The default affinity matrix is reseeded against the empirical question distribution of the new domain.

Estimated effort. Four to six engineering weeks for a seventy-percent-fidelity replica, with one engineer at the agentic-LLM-experienced level and one at the integration-engineer level. The dominant cost lines are the CRM integration (one to two weeks depending on the CRM and its API surface) and the regex pattern-pack rewrite (one week for the legal-advice safety surface, one week for the new affinity matrix and primary-category election).

Risks. The customer-data PII redaction surface is materially larger — phone numbers, email addresses, order numbers, billing addresses, payment-card hints in transcripts. The redaction patterns must be tuned for the domain (a hospital does not see credit-card numbers; a product-support call does). SLA expectations from the operator are typically harder than from a hospital — a CTO buying for a phone-support operation will expect first-audio latency at the ninety-fifth percentile to be under two seconds, where a hospital pilot tolerates four. The latency budget must be tightened.

6.2 — Appointment-booking spinoff

What stays the same. The full seven-layer stack, the thin-pipeline shape, language locking at first utterance, the prosody injection and adaptive TTS speed, the listening-acknowledgment and context-aware filler patterns, the telemetry surface, and the silent-failure discipline. The regex pre-filter and post-filter shells are preserved — only their pattern packs change. The structured_call structured-output discipline is the same.

What changes. The intent set is restricted to scheduling intents: book, reschedule, cancel, query availability, confirm. The cognitive core changes from retrieval-augmented generation to availability-augmented generation — the search_hospital_kb tool is replaced with query_availability (against the existing scheduling system, e.g., Cronofy, Microsoft Bookings, or a domain-specific scheduler), and a lock_slot tool is added that places a soft hold on a slot during the call so the system does not double-book. A confirm_booking tool finalises the lock once the caller has agreed. A cancel_booking and a reschedule_booking tool round out the set. The Value Framework is largely unused in this domain — there are no cross-category contamination problems when the entire conversation is scheduling-shaped — but the primary-category election can be repurposed for service-type routing (general-practice vs specialist vs follow-up) and the unit-mismatch detector for time-zone confusion. A new entity-extraction module is added for spoken dates: "next Tuesday at three", "two weeks from now", "the first available after the fifteenth" must all map to absolute timestamps. This is a non-trivial NLU surface; it is the single largest source of new code.

Estimated effort. Three to five engineering weeks. The intent restriction simplifies parts of the cognitive core; the date-NLU work and the slot-locking primitives compensate. One engineer at the agentic-LLM-experienced level can ship the bulk of this; the integration with the chosen scheduler is the variable cost line.

Risks. Time-zone correctness is a category of bug that does not exist in the hospital domain (ZOL serves one timezone) and is the dominant correctness risk in the booking domain. Two-and-a-half-hour offset bugs are a known pattern when callers say "next Tuesday at three" in one timezone and the agent renders it in another. The slot-locking primitive is the second risk: race conditions on slot reservation, where two concurrent calls both lock the same nine-thirty slot, must be handled at the scheduler-API layer with optimistic concurrency control. The voice-stack architecture itself does nothing wrong; the failure mode lives in the integration boundary.

Architecture option — this is the one spinoff where voice-native is genuinely viable. A booking line has no hard no-medical-advice constraint, and its only consequential output is a structured action (the slot write), not free-form advice. That makes it the natural candidate to drop the cascade for a voice-native speech-to-speech model (see Cascade vs. voice-native in §2) and capture the sub-1.2 s latency and full-duplex naturalness that matter most when a caller is reading out dates and times. The lesson carried over from the hospital build is to separate conversation from transaction: let the duplex model own the natural dialogue (slot-filling, "morning or afternoon?", barge-in), but route the actual booking through a strictly-validated, confirmed, idempotent tool calllock_slot / confirm_booking with a schema-checked payload, an explicit spoken read-back ("Tuesday the twelfth at two o'clock with Dr. Peeters — shall I confirm?") before the write, and an idempotency key so a barge-in or a retry cannot double-book. The cascade's deterministic control is not abandoned; it is relocated from every spoken sentence to the one point that has consequences. The rest of this compendium still transfers wholesale: build the trace/replay/SLO toolchain before tuning prompts, evaluate on spoken input with a lowest-effort baseline, treat feature flags as validation gates to be removed once proven, and keep the tenant configuration DB-driven. Note the observability cost flips: with a duplex model you lose the per-stage text events the cascade emits, so you must capture the realtime event stream and transcribe both sides for the audit record.

6.3 — Telemedicine-triage spinoff

What stays the same. The seven-layer stack, the safety architecture (kept fully — this is more sensitive than ZOL, not less), language locking at first utterance, the citation grounding (now sourcing from clinical guidelines rather than hospital wayfinding content), the multi-tenant overlay system, and the silent-failure discipline. The PII redaction surface is the largest of the three domains.

What changes. The cognitive core's intent set adds triage intents (assess, escalate, schedule_follow_up). The search_hospital_kb tool is replaced with search_clinical_guidelines (sourced from published triage protocols — Manchester Triage System or equivalent regional standards). A new assess_severity tool encodes the triage protocol's escalation thresholds: high-severity symptoms (chest pain with radiating arm pain, sudden severe headache, severe shortness of breath) trigger immediate escalation to emergency services; medium-severity symptoms route to same-day GP availability; low-severity symptoms route to information-and-self-care content. The disclaimer becomes mandatory on every turn, not just turns matching medical content — a triage agent's role is medical-information-adjacent by construction, and Article 50(2) transparency is satisfied by per-turn rather than situational disclosure. The Value Framework is repurposed: the categories become triage-protocol-shaped (severe, moderate, mild, self-care, escalate), and the primary-category election routes the conversation rather than just the answer.

Estimated effort. Eight to twelve engineering weeks. The regulatory-compliance work dominates: the system is now classifiable as software-as-a-medical-device under MDR Annex VIII Rule 11 (software providing diagnostic or therapeutic information), not the negative classification we hold in the hospital domain. CE-marking, clinical evaluation, post-market surveillance, and vigilance-reporting infrastructure must all be in place before the system can serve a single triage call in production. The voice-stack engineering is comparatively small — a five-to-six-week implementation, similar to the appointment-booking spinoff — but it sits underneath six-to-eight weeks of regulatory and legal work that the hospital pilot did not have to do.

Risks. The regulatory exposure is the dominant risk and is not a software-engineering risk. Citation grounding becomes safety-critical, not just provenance; an inferred or speculative claim that surfaces in a triage turn is a clinical-safety incident, not a quality regression. The adversarial-counter-evidence pass from section four becomes a hard requirement, not a nice-to-have. Per-claim grounding (grounded / inferred / speculation labels) becomes operational triage signal rather than evaluation telemetry. The R3 contract test surface expands to cover the agent-to-clinician handoff at SIP REFER time: any escalation must include the structured turn-history record so the receiving clinician has continuity of context, and the contract test must pin that the record arrives intact.

7. Replication runbook

This section is operational. It tells a team adopting the stack what to provision, what to configure, and how to drive the day-one onboarding.

The tenant overlay system

The voice channel runs in a multi-tenant SaaS. Each hospital tenant has a different name, different campus addresses, different doctors, and (sometimes) different STT mishears specific to its directory. The challenge: serve all of these correctly without hardcoding any of them in shared code or in YAML.

The architecture composes three categories of voice configuration:

  1. Generic patterns and language rules live in source code (_FAQ_ENTRIES, _STT_NORMALIZATIONS in their hospital-agnostic forms). These are linguistic, not tenant-specific.
  2. Tenant-specific phonetic-recovery (STT mishears) live in YAML overlays at tenant_overlays/_yaml/<slug>.yaml. These are phonetic data that does not exist anywhere else — "zon" mishears as "zol" is specific to the ZOL hospital.
  3. Tenant data (addresses, names, hours) live in the database, keyed by tenant slug, and are read at request time via get_taxonomy(slug) and rendered by a registered Python renderer (e.g., render_campus_listing). House numbers, postal codes, and phone numbers stay as digits per the TTS prosody rules.

At request time the three are composed into the effective configuration for the turn. Adding a campus to the database is reflected on the next request — there is no synchronisation gap, no source-code change, no deploy. The detail is in Tenant Overlay System. This satisfies the multi-tenant SaaS isolation taxonomy of Bezemer and Zaidman 2010 without requiring a per-tenant deployment topology.

The default affinity map

The Value Framework's seven-intent × six-category affinity matrix is reproduced below in full. The matrix is the default; tenants can override it at deployment time, but the wheelchair regression and the broader test suite are pinned to these values.

Intentpracticalclinical_inforegulatoryappointmentslegal_admingeneral
navigation_or_practical_info1.300.650.551.050.851.00
appointment_scheduling1.050.800.751.300.951.00
medical_information0.751.251.050.950.851.00
doctor_information0.901.100.851.200.851.00
department_or_service1.101.100.851.200.901.00
administrative_or_legal0.900.801.200.951.301.00
billing_or_insurance0.850.851.300.951.101.00

Multipliers above one boost; below one penalise. One is neutral. The maximum boost is 1.30 and the maximum penalty is 0.55 — the boundary values were chosen so that even a worst-case ranking inversion (boosted chunk at similarity 0.50 vs penalised chunk at similarity 0.95) can flip the order, but no further. Intents not in the map default to neutral across all categories — the framework never makes things worse than the unfiltered ranking.

Required infrastructure

The pilot infrastructure for one tenant is summarised below. The numbers are sized for twenty-five thousand queries per month with average call duration around forty-five seconds; the dominant cost lines are vendor pass-throughs (Twilio, OpenAI, Deepgram, ElevenLabs).

ComponentCountSizingMonthly cost (rough)
Twilio Elastic SIP Trunk11 Belgian PSTN number, ~25K minutes/mo~€60 (number rental + per-minute)
LiveKit Server12 vCPU, 4 GB RAM, container~€15 (server share)
livekit-sip gateway11 vCPU, 1 GB RAM, container~€8 (server share)
voice_agent worker1–N2 vCPU, 2 GB RAM per N concurrent calls~€15 per 50 concurrent
Backend (FastAPI)14 vCPU, 8 GB RAM, container~€30
PostgreSQL 16 + pgvector14 vCPU, 16 GB RAM, 100 GB SSD~€80
Redis 711 vCPU, 1 GB RAM~€10
Keycloak 2411 vCPU, 1 GB RAM~€10
OpenAI (LLM + embeddings)per-token~€150 (LLM) + ~€20 (embeddings)
Deepgram Nova-3per-second of audio~€100
ElevenLabs Multilingual v2per-character of TTS~€100
Total — pilot scale~€600/month

The backend, Postgres, Redis, and Keycloak are shared across tenants in the multi-tenant SaaS shape. The voice_agent worker is the elastic scale unit; one worker handles around fifty concurrent calls, and the worker count scales with the operator's expected concurrency profile.

Day-1 onboarding checklist

The minimum bootstrap steps for a new tenant or a new domain replica:

  1. Domain and TLS. Provision a public DNS record for the SIP gateway (e.g., sip.<tenant>.example.com) and a Let's Encrypt SIPS certificate; configure firewall rules to expose port 5061 (SIPS) and 50000-60000 UDP (RTP).
  2. Twilio configuration. Buy the Belgian (or other-country) PSTN number, configure the SIP trunk to forward INVITEs to the SIP gateway DNS name, set the IP allowlist to the SIP gateway, and configure emergency routing per the local regulator.
  3. LiveKit project keys. Generate API key and secret, configure the LiveKit Server livekit.yaml with the keys and the Redis URL, and configure the SIP gateway livekit-sip.yaml with the trunk inbound rule and dispatch rule.
  4. Vendor accounts. Deepgram project (Nova-3 enabled), ElevenLabs voice ID (Multilingual v2 selected), OpenAI API key (GPT-4.1 access + text-embedding-3-large access).
  5. Keycloak realm. Create the realm, define the four roles (owner, admin, manager, user), seed the platform-admin and tenant-admin accounts.
  6. Postgres bootstrap. Run the Alembic migrations to head; seed the tenant row; ingest the corpus (crawl + chunk + embed); seed the taxonomy tables (entities, relationships).
  7. Tenant overlay. Create tenant_overlays/_yaml/<slug>.yaml with the tenant's STT mishears (typically the hospital name homophones); create the DB-driven renderers for any tenant-specific answer composition (e.g., campus listing).
  8. Smoke test. Run the per-language voice smoke test from voice/smoke-test-script.md. The script exercises greeting, factual question, dosage refusal, handoff request, language-switch refusal, and farewell.
  9. Monitoring. Provision the operations dashboard for the tenant; verify the Category Mismatch Trend and Diagnostic Accuracy Trend render. Set up log aggregation that retains R1 log lines for at least thirty days.
  10. Pilot calibration. Run a hundred-call calibration pass to surface the tenant-specific STT mishears that are not yet in the YAML overlay; iterate.

What an operator can and cannot measure without code access

A non-engineer operator (clinical lead, hospital communications manager) can answer the following from the dashboard alone, without code access: per-tenant call volume by hour and day; per-tenant first-audio latency at p50, p95, p99; per-tenant category-mismatch rate trended over time; per-tenant safety-refusal rate (Stage 1 fires per 1000 turns); per-tenant disclaimer-prepend rate (Stage 4 fires per 1000 turns); per-tenant call-disposition mix (answered, escalated, farewelled). They can drill into individual calls and replay the conversation transcript with citations.

What requires code access: the per-stage latency breakdown beyond the aggregate p95 (operator sees the budget total, engineer sees which stage broke it); the structured_call retry count per call site (operator sees the result, engineer sees the recovery); the regex pattern packs themselves (operator sees the fire, engineer reads the pattern); the affinity matrix entries (operator sees the rerank result, engineer changes the entries). This is the right division of labour: the operator cares about the outcomes, the engineer cares about the levers. The audit-log retention for the operator's surface is configurable per the data-retention policy; the engineer's surface lives in the structured logs and is retained at the platform-engineering team's discretion. The framework alignment is ISO/IEC 27001:2022 for the ISMS posture, although we are not certified.

8. Bibliography

This compendium uses BibTeX-key citations with deep-links to the project bibliography page. The full bibliography is at /docs/references and is maintained as a single source of truth from docs/references.bib. Each entry on that page is URL-verified and dated; new entries follow the note={Verified <YYYY-MM-DD> — <one-line summary>} convention.

The most-cited sources for this document, grouped by theme:

Retrieval-augmented generation and modular RAG. Lewis et al. 2020 — original RAG architecture. Gao et al. 2024 — modular-RAG taxonomy. Sarmah et al. 2024 — knowledge-graph + vector hybrid, structurally adjacent to the Value Framework's categorical-affinity multiplier. Cormack et al. 2009 — RRF lineage. Robertson and Zaragoza 2009 — BM25 baseline. Nogueira and Cho 2019 — passage reranker baseline.

Embeddings and vector storage. OpenAI 2024 embeddingstext-embedding-3-large. pgvector — embedding store.

Voice stack. LiveKit Agents — the agent runtime. Deepgram Nova-3 — production STT. ElevenLabs Multilingual v2 — production TTS. Wang et al. 2017 (Tacotron) — neural-TTS lineage. Radford et al. 2023 (Whisper) — STT alternative considered. The structured_call helper (app.llm.structured) — structured-output discipline at eight LLM call sites.

Telephony standards. RFC 3261 — SIP. ITU-T E.164 — international phone-number format.

Cascade vs. voice-native (verified 2026-05-30, direct sources — not yet in the central bibliography). Twilio, Core latency for AI voice agents (Nov 2025) — additive cascade latency budget (~1.1 s median). OpenAI, Introducing gpt-realtime and the Agents SDK realtime guide — single-model speech-to-speech + output-guardrail (debounced, reactive) behaviour. τ-Voice, arXiv:2603.13686 (Sierra AI, Mar 2026) — S2S latency table (gpt-realtime 0.90 s) + the 26–51 % multi-step task-completion gap. FLEXI, arXiv:2509.22243 — full-duplex naturalness latency tiers (sub-400 ms). VoiceBench, arXiv:2410.17196 (TACL 2026) — speech-vs-text safety regression (~49 pp). VoiceAgentRAG, arXiv:2603.02206 — semantic prefetch-cache latency reclaim (316× on cache hits). OpenAI, EU data residency in Europe.

Conversation analysis and UX. Sacks, Schegloff, and Jefferson 1974 — turn-taking organisation. Nielsen 1993 — response-time thresholds.

Safety and compliance. OWASP LLM Top 10 — practitioner threat taxonomy. GDPR, AI Act, MDR — EU regulatory regime. HLEG 2019 — ethics-guidelines lineage. Zou et al. 2023 (GCG) and Liao and Sun 2024 (AmpleGCG) — adversarial robustness threat class.

Operational practice and multi-tenancy. Beyer et al. 2016 (Google SRE) — SLOs at the tail. Bezemer and Zaidman 2010 — multi-tenant SaaS isolation taxonomy. ISO/IEC 27001:2022 — ISMS framework alignment.

Additional voice context. Lin et al. 2026 (FullDuplexBench-v3) — benchmark for full-duplex voice agents under disfluency.

For depth on any specific component, the per-feature pages cross-linked throughout this document carry the implementation detail; this compendium is the orientation document, not the reference manual.