Voice Channel — Architecture

The voice channel is the production telephony surface for the ZOL intelligent search system. A caller dialling the Belgian PSTN number reaches a self-hosted SIP gateway, which bridges into a LiveKit room (@livekit_agents_docs) where a voice_agent worker runs Deepgram Nova-3 streaming ASR (@deepgram_nova3) on the inbound audio and ElevenLabs Multilingual v2 (@elevenlabs_multilingual_v2) on the outbound. Per-turn cognition is delegated over a WebSocket to the backend, where VoiceLLMOrchestrator runs the regex pre-filter → GPT-4.1 tool-call loop → safety post-filter → answer-shaper sequence described below.

This page documents the composition principle, the module layout, the per-turn flow, the cross-component sequence, and the architectural trade-offs that shaped the current design.

The composition principle

The voice channel is agentic-only: a single GPT-4.1 agent with three tools is the cognition layer. ADR-0049 and ADR-0051 retired the legacy 8-stage VoiceOrchestrator (deleted in commit 158d793, ~7,000 LOC removed) in favour of an LLM agent that calls the existing RAGService.query_stream directly via a search_hospital_kb tool, with channel="voice" set on the request.

ADR-0053 (2026-05-22) further established native OpenAI streaming-with-tools as the dispatch pattern: each tool-decision iteration is a single chat.completions.create(stream=True, tools=_TOOLS, tool_choice="auto") call rather than the earlier two-call pattern (non-streaming tool decision → separate streaming final). For direct-response queries ("Do you speak English?", "Wat zijn jullie openingsuren?"), this halves OpenAI round-trips: ONE call from query to last token. For tool-using queries, the loop is N+1 calls (one per iteration, one for the final synthesis), all streamed. Both cases produce a chunk event stream that the voice_agent transforms into sentence-grain session.say() invocations via a regex-bounded sentence buffer.

voice_agent (livekit-agents)
  → backend WS /ws/public-query
    → VoiceLLMOrchestrator
        1. regex pre-filter (classify_terminal)
        2. GPT-4.1 tool loop (max 3 iterations):
             search_hospital_kb     →  RAGService.query_stream (channel="voice")
             transfer_to_helpdesk
             end_call
        3. safety post-filter (regex on LLM output)
        4. VoiceAnswerShaper (TTS phone formatting)
        5. medical-disclaimer prepender (if Stage 4 detects medical content)

RAGService.query_stream is unchanged from the chat channel. Setting channel="voice" activates voice-specific behaviours inside the RAG pipeline: the Value Framework affinity rerank (@cormack2009rrf is the rank-fusion lineage; we extend it with a categorical-affinity multiplier — see Value Framework), citation derivation from chunks rather than inline [N] markers (see Citation Pipeline), and a voice-shaped LLM system prompt from app.prompts.build_voice_llm_orchestrator_system_prompt.

Architectural trade-offs

Three foundational decisions define the voice channel's shape; each is captured in an ADR with the alternatives that were considered and rejected.

Decision	Chosen	Alternatives considered	Rejected because
Cognition topology	Agentic LLM with tools (ADR-0051)	8-stage deterministic pipeline (Phase A `VoiceOrchestrator`); thin pipeline (regex pre-filter → FAQ → RAG, no LLM agent)	The 8-stage pipeline accumulated ~7 000 LOC, six dangling feature flags, and a sub-5% cache hit rate on speculative-STT; the thin pipeline could not handle compound queries ("which doctor at cardiology AND what are the visiting hours") without re-introducing intent state. The agentic LLM lets the model decide tool dispatch on a per-turn basis without persisted dialogue state.
Telephony stack	Self-hosted Twilio SIP + LiveKit (ADR-0050)	LiveKit Cloud SIP managed gateway; Twilio voice with no LiveKit	LiveKit Cloud SIP is $0.30–0.50/participant-hour ($375–625/month at the projected 25K queries/month scale); self-hosted is $0 marginal. Twilio voice without LiveKit ties cognition to TwiML, foreclosing the agent runtime. (RFC 3261)
Mid-call language	Locked at first utterance (ADR-0052)	Multi-language Deepgram on every turn; switch tool for explicit language requests	Multi-language Deepgram degraded Flemish accuracy materially (`"bezoekuren"` → `"bezukjuren"`); the switch tool was structurally broken because Deepgram emits zero transcripts on speech in the locked-out language, leaving no signal for any detector. See Language Locking.
Citation strategy	Chunk-derived fallback (no inline markers in answer)	Inline `[N]` markers as on chat	TTS reads `[1]` as `"open bracket one close bracket"`; the voice prompt strips markers, so the chat channel's marker-extractor produces empty citations. See Citation Pipeline.

Module layout

backend/app/
├── api/
│   ├── query.py                          # channel-dispatch → VoiceLLMOrchestrator
│   └── public_websocket.py               # channel-dispatch → VoiceLLMOrchestrator
├── models/
│   └── schemas.py                        # QueryRequest.channel + detected_language
├── config.py                             # voice_llm_orchestrator_enabled (no-op, ADR-0051)
├── prompts.py                            # build_voice_llm_orchestrator_system_prompt
├── services/
│   ├── rag_service.py                    # untouched; called via search_hospital_kb
│   └── voice/
│       ├── voice_llm_orchestrator.py     # THE integration seam (ADR-0051)
│       ├── voice_thin_pre_filter.py      # classify_terminal + shared helpers
│       ├── voice_routing_dispatch.py     # unified voice_routing_rules dispatcher (Sprint E / Wave A)
│       ├── voice_answer_shaper.py        # TTS phone formatting + disclaimer prepender
│       ├── voice_faq_renderers.py        # DB-driven FAQ answer renderers
│       ├── voice_pii_redaction.py        # caller-ID pseudonymisation
│       ├── voice_turn_evaluator.py       # per-turn LLM-as-judge scorer (structured_call)
│       ├── sip_concurrency_limiter.py    # max-concurrent-calls limiter
│       ├── sip_rate_limiter.py           # per-caller rate limiter (Redis)
│       └── tenant_overlays/              # multi-tenant FAQ + STT overlay package
│           ├── __init__.py               # public get_overlay()
│           ├── loader.py                 # YAML loader + tenant resolution
│           ├── registry.py               # in-process LRU
│           └── schema.py                 # Pydantic v2 overlay schema
│   └── value_framework/                  # intent-to-category affinity rerank
│       ├── __init__.py
│       ├── affinity.py                   # apply_intent_category_affinity
│       ├── category_classifier.py        # classify_chunk_category
│       ├── telemetry.py                  # record_category_mismatch
│       └── unit_mismatch.py              # admit unit-mismatch gaps

Per-turn flow

The tool-call loop is bounded by voice_llm_orchestrator_max_tool_iterations (default 3). On overflow the orchestrator emits a fixed transfer text rather than continuing to spend tokens. A second short-circuit guards against the gibberish-rephrase loop pattern from the 2026-05-07 traffic: two consecutive search_hospital_kb calls returning found=False force-transfers to the helpdesk (voice_llm_orchestrator.py:519–556).

Latency budget per stage

The per-turn latency budget below is taken from local dev measurement; production pilot measurement is pending. * markers indicate stages whose timings have not yet been pinned to a histogram on the pilot.

Stage	Local-dev p50	Notes
`classify_terminal` (regex)	< 1 ms	Pure-Python regex; deterministic
GPT-4.1 first-token (tool decision)	300–600 ms*	OpenAI chat completion w/ tools; bounded by network + LLM provider
`search_hospital_kb` round-trip (RAG inner loop)	600–1 200 ms*	Embedding + pgvector + BM25 + rerank + LLM stream
Safety post-filter (regex)	< 1 ms	Pure-Python regex
`VoiceAnswerShaper.shape`	5–20 ms*	Six regex transforms + medical-content detector
ElevenLabs TTS first audio chunk	200–400 ms*	ElevenLabs Multilingual v2 streaming

* Per Beyer et al. 2016 §4 (Service Level Objectives), latency SLOs should be written at p95/p99 — not the mean. A pilot measurement pass (Phase 5 of the readiness plan) will replace the dev p50s above with pilot p95s. Nielsen 1993

Voice_agent filler ladder and grace tuning

The voice_agent runs a three-tier filler ladder concurrently with the backend RAG dispatch. Each tier asks the pure voice_agent.filler_gate module: "given current state, should I fire?" The decision is small (a few boolean conditions) but its correctness has bitten production multiple times — captured here for future maintainers.

Tier model

Tier	Grace before firing	Purpose	Example phrase
Tier 1	1500 ms (was 800 ms before commit `38d7f6be`)	Natural pause bridge — masks the gap between `voice_turn_start` and the first sentence event from backend	"Een ogenblikje", "Let me check that for you"
Tier 2	4 000 ms	Acknowledge ongoing search	"Almost there, just another second."
Tier 3	10 000 ms	Unusual delay — acknowledge once, then go silent	"Almost ready, I have nearly all the information."

Each tier is cancelled the moment the first streaming sentence event arrives from backend — the gate's streamed_answer_spoken predicate flips True inside _on_streaming_chunk and subsequent tiers no-op.

Grace tuning chain (2026-05-22 → 2026-05-26)

The tier-1 grace was tuned four times in rapid succession after the LLM-first agentic pipeline (ADR-0053) shifted the latency distribution. Each retune is preserved here because the reasoning chain is non-obvious from the constants alone:

Commit	Date	Change	Reason
(pre-trust-LLM)	—	`800 ms`	Original Task 8 / 2026-05-22 spec — sat between the 600 ms human-noticed-silence floor (arXiv 2507.22352) and the ~1 s mark where the answer LLM typically produced its first complete sentence under the older two-call pattern
`38d7f6be`	2026-05-23	`800 → 1500 ms`	Pilot SIP call with `Do you speak English?` fired a spurious "Let me search that for you" filler before the actual `Yes, I speak English` response. Backend turn was 622 ms, but LiveKit + Twilio SIP transport adds 50-200 ms each way, landing `_streamed_answer_spoken=True` at ~700-1100 ms — right at the original threshold. The third clock (transport latency on the SIP path) was unaccounted for.
`a54ce8de`	2026-05-23	+ `language_switch_recent` gate	The first turn after an NL→EN language switch has a ~3.3 s gap between `voice_turn_start` and the backend WS receipt — STT/TTS plugin reload eats the wait. The 1500 ms grace expired during the reload window. Solution: skip tier-1 entirely if a language switch happened within `LANGUAGE_SWITCH_GRACE_WINDOW_S` (default 5 s). Tier-2 at 4 s still fires for genuine stalls.
`a0757271`	2026-05-26	Same-language probe no longer flags as a switch	Pilot trace `62321b74` (2026-05-26 11:46 UTC) showed Turn 1 of every call firing tier-2 at +7003 ms instead of tier-1 at +1500 ms. Root cause: the first-utterance language probe (`on_user_turn_completed:~998`) calls `_switch_language(detected)` even when `detected == current` to flip Deepgram STT from multi-mode to single-language mode. The old code set `_last_language_switch_at` unconditionally — so the `language_switch_recent` gate added in `a54ce8de` false-positived on every Turn 1, suppressing tier-1 for the entire 5 s grace window. Fix: gate the timestamp assignment inside the existing `_previous_language != target_language` block so the same-language Deepgram-mode flip no longer poses as a switch. Real cross-language switches still set the timestamp (TTS reload protection preserved). Saves ~5.5 s of perceived silence on the first turn of 100% of calls.

Three independent skip conditions

Tier-1 fires only when ALL of these are false:

rag_task_done — backend has returned the full response (non-streaming path)
streamed_answer_spoken — first sentence already arrived; agent is already speaking
is_terminal_phrase — caller said farewell / handoff / appointment; backend short-circuits, a filler would auto-complete before the terminal response
language_switch_recent — STT/TTS reload in progress; the rag_task hasn't even reached backend yet. Important nuance: this flag is set by _switch_language only when the language actually changed (per fix a0757271, 2026-05-26). The first-utterance probe that flips Deepgram from multi-mode to single-language mode with the SAME target language does not trip this gate, so Turn 1 of every call now fires tier-1 at +1500 ms instead of tier-2 at +7000 ms.

The function lives in voice_agent/filler_gate.py as a pure module — unit-tested without LiveKit fixtures, runs instantly. Adding a fifth skip condition would touch only the gate predicate + a new test; the dispatch loop in agent.py stays untouched.

Session-start cache warmup — disabled by default

HospitalVoiceAgent.warmup_cache() exists but is gated behind VOICE_WARMUP_CACHE_ENABLED (default false). When enabled, the agent fires the 4 most-common-question queries at the backend immediately after greeting, in parallel, to pre-populate the semantic cache. The intent was sub-100 ms first answers when the caller's question matched one of the warmup queries.

In practice the warmup queries serialise on the backend's OpenAI client / WS handler chain — each takes 4-7 s, sequentially — and the caller's actual first query waits 15-20 s behind the warmup batch. The voice_agent's own filler ladder then plays tier-2 and tier-3 fillers because the backend genuinely IS taking that long. The docstring at voice_agent/agent.py:warmup_cache calls this out explicitly: "caller hears 3 fillers and an offer to transfer before the agent speaks the actual answer."

The flag was re-enabled briefly during streaming-TTFT validation on 2026-05-22 and disabled again on 2026-05-23 after a pilot SIP test reproduced the exact failure mode the docstring described. The underlying serialisation has not been fixed; the flag stays at false until a backend WS / OpenAI-client concurrency investigation produces a fix.

Post-refit fixes (2026-05-26 evening)

Two more production fixes landed the same day as the a0757271 grace re-tune, on different layers but originating from the same pilot trace session. They are listed here rather than in the grace-tuning chain because neither touches the filler ladder — they belong to the wider "things that happen between STT and the LLM" surface.

Commit	Layer	Change	Reason
`eb5ebf66`	voice_agent / `_is_short_clarification`	Bare emphasis affirmations ("ja graag", "yes please", "graag", "ja alsjeblieft", "yes thank you") no longer trigger FAQ-followup context-carry. The `_EMPHASIS_ONLY_NL_EN` set in `voice_agent/agent.py:446` short-circuits the carry; the turn falls through to the LLM as a fresh utterance.	Pilot trace `92e11ea3` turn 12 (2026-05-26) showed the agent stitching the previous question ("wat zou ik dat zijn…") with the caller's bare "ja graag" into a malformed compound query. Emphasis-only affirmations carry no new content to combine with the prior turn.
`1be07148`	backend ZOL tenant overlay (`zol.yaml`)	New `public_service_numbers_lookup` rule in `voice_routing_rules` matches `112` / `1733` / `huisartsenwacht` / `noodnummer` / `spoednummer` (plus EN/FR/IT equivalents) and returns deterministic text listing the three Belgian public-service numbers (112 EU emergency, 1733 huisartsenwacht / out-of-hours GP, 089 80 80 80 ZOL spoed).	Pilot voice call `62321b74` turns 22–23 (2026-05-26) showed the agent refusing to read out `1733` and `112` as "specifiek medisch advies". These are national public-service numbers — refusing them is unsafe. The rule fires inside `voice_routing_dispatch.dispatch()` BEFORE the LLM sees the query. Regular doctor / department phone lookups still flow through RAG.

The second fix sits on voice_routing_dispatch.py — the unified Sprint-E rule dispatcher whose schema (crisis → emergency → pii_refusal → identity → clarification → symptom_triage → out_of_scope_redirect → faq) is loaded from _yaml/_defaults.yaml plus per-tenant overlays. Any rule whose pattern matches short-circuits the LLM with the rule's response_text; the FALLTHROUGH set is what reaches the GPT-4.1 tool loop described above.

Cross-component sequence (voice_agent ↔ backend)

Two views of the same conversation are maintained side by side. The summary view is the right diagram when the audience needs to see the orchestrator–RAG split and the role of the Value Framework. The full view is the right diagram when the audience needs to see every external SaaS call, every short-circuit before the LLM, the tool loop, the streaming-audio-return pattern, and the post-turn evaluator — i.e. the operational reality of a single SIP call from dial to BYE.

Summary view — 6 participants

The Value Framework is a child of RAGService.query_stream, not a peer of the orchestrator. The orchestrator never calls apply_intent_category_affinity directly. This separation matters because the chat channel benefits from the same affinity rerank without requiring orchestrator changes.

Full view — 12 participants, dial through BYE

This view expands the summary in five ways. (1) Two short-circuits before the LLM — classify_terminal and the unified voice_routing_dispatch (rules: crisis → emergency → pii_refusal → identity → clarification → symptom_triage → out_of_scope_redirect → faq, including the public_service_numbers_lookup rule added by 1be07148 for 112 / 1733 / spoed) — together handle a large fraction of real-world traffic with zero LLM cost. (2) The GPT-4.1 tool loop is the actual control flow of agentic-only voice (max 3 iterations, with a 2-consecutive-empty-search escalation guard). (3) The RAG-internal sub-pipeline reached via the search_hospital_kb tool — semantic cache → intent classification → hybrid retrieval (pgvector + BM25 + taxonomy) → Stage 5b VF affinity rerank → Stage 5c doctor-list injection → context assembly → response generation — is drawn explicitly. (4) The external SaaS calls (Deepgram, OpenAI, ElevenLabs) get their own lanes because they dominate the latency budget. (5) The concurrent filler ladder that masks backend latency on the caller side gets its own par block.

Honesty caveats on the full view. Three places where the diagram simplifies real behaviour and a future maintainer should be aware:

par block for the filler ladder depicts three concurrent timers, which is correct, but a real call only ever fires the tier(s) whose timeout actually expires before streamed_answer_spoken flips. So in practice you'll see one filler fire (or none if the backend is fast enough), not all three. The diagram shows concurrent dispatch, not typical observed behaviour.
Block F (post-turn evaluator) is drawn as a sequential block after E for visual clarity, but in practice it runs as a fire-and-forget asyncio task — caller-perceived latency does not include it.
The DB lane rolls up four very different stores: pgvector (semantic similarity), BM25 tsvector (lexical), taxonomy (relational), and telemetry tables (conversation_messages, category_mismatch_telemetry, voice_turn_evaluations). If a presentation needs to make the storage diversity explicit, split this lane.

Where voice differs from chat

Concern	Chat channel	Voice channel
Citations	Marker-based extractor (`[1]`, `[2]`, …)	Chunk-derived fallback — voice prompt strips inline markers (un-speakable); `_qs_finalize` derives citations from retrieved chunks directly
Language	Per-query detection, can switch any turn	Locked at first STT-confirmed utterance for the duration of the call (ADR-0052)
LLM model	`gpt-4.1` (quality-first)	`gpt-4.1` with tool-use (latency-conscious; max 3 tool iterations)
Follow-up suggestions	Rendered in widget	Suppressed — no UI to display them; voice shaper strips them
Markdown	Rendered by widget	Stripped — ElevenLabs speaks raw text
Answer length	Up to 500 tokens	Enforced ≤ 2 sentences by system prompt + `VoiceAnswerShaper.max_sentences=2`
Error routing	HTTP 4xx / chat error message	`transfer_to_helpdesk` tool call → SIP REFER
STT	Not applicable	Deepgram Nova-3, Dutch-primary, language locked at first utterance
Disclaimer	Shown inline	Spoken; auto-detected from answer text by `_detect_medical_content_in_answer()` (Wave 2.C-tail D2)
Structured-output safety	`structured_call` thin helper (~190 LOC over `AsyncOpenAI`) at 8 LLM call sites (turn evaluator, intent classifier, etc.). Replaced Pydantic AI on 2026-05-12 — see Decision-Cost Rubric	Same; the change applies channel-uniformly

Feature-flag topology (current)

Setting	Default	Status
`voice_llm_orchestrator_enabled`	`true`	No-op — kept only so existing `.env` files don't trip `extra="forbid"`. Will be removed in follow-up cleanup per ADR-0051.
`voice_disclaimer_enabled`	`true`	Spoken medical disclaimer prepended on detected medical content (post-LLM, post-shape).
`voice_stt_ambiguity_guardrail_enabled`	`true`	Dangling. The `stt_ambiguity_guardrail.py` module was deleted in `158d793`; safety-refusal is now always-on inside `classify_terminal()`. The setting is read but has no consumer. Removal pending.
`voice_escalation_confidence_threshold`	`0.65`	Dangling. Consumer `voice_safety_gate.py` was deleted in `158d793`. Removal pending.
`voice_conversational_intent_llm_model`	`"gpt-4.1-nano"`	Dangling. Consumer `conversational_intent_resolver.py` was deleted in `158d793`. Removal pending.
`voice_llm_orchestrator_max_tool_iterations`	`3`	Tool-call iteration cap. On overflow the orchestrator emits a fixed transfer text.

The four dangling settings form a cleanup batch: their .env.example entries, Settings fields, and any docs referencing them as live should be removed in a follow-up sprint. The settings persist for now because removing them mid-deploy would trip extra="forbid" on existing operator .env files.

Per-channel LLM fallback

The llm_fallback_chain utility is still registered but the agentic path goes directly to GPT-4.1. On any tool/LLM exception, VoiceLLMOrchestrator returns conversational_intent="escalate" and the voice_agent SIP-transfers the caller to the ZOL helpdesk. There is no "fall back to thin" path — ADR-0051 removed it.

Structured-output safety

Eight LLM call sites — including IntentClassificationService.classify_intent, ConversationClassifierService, and VoiceTurnEvaluator — use a thin structured_call(prompt, output_model) helper that wraps the OpenAI client with response_format=json_object, validates via pydantic.BaseModel.model_validate_json(), and retries once on a ValidationError before raising a typed fallback. Malformed JSON never reaches downstream code, and the defensive json.loads try/except blocks that older call sites carried are gone. VoiceTurnEvaluator (the per-turn LLM-as-judge scorer) depends on this contract: a malformed score would otherwise silently degrade the diagnostic.

Voice ops infrastructure

A two-day calibration sprint ending 2026-05-23 hardened the voice path against the modes that surfaced in eight weeks of pilot SIP testing. Five changes shipped as a bundle:

Change	File	Reason
Rule 4.5 — no repeated clarifications	`prompts.py` system prompt	Voice eval showed agents asking the same clarification 3 times across a single conversation; rule forbids it and commits to a search-or-handoff after one clarification
Temperature 0.3 → 0.0	`config.py:voice_llm_orchestrator_temperature`	Removes per-turn answer variance; the agent's job is to commit to a single grounded response, not to sample creatively
Rule 6.5 — list what the corpus lists	`prompts.py` system prompt	Procedure questions ("How does an MRI work?") now surface the corpus's actual content, not generic answers; turns procedure questions into search hits rather than reflexive disclaimers
Tier-0 ack removed + tier-1 session rate limit	`voice_agent/agent.py`, `filler_gate.py`	Pilot SIP calls accumulated "mhm" acks that read as nervous; tier-0 disabled, tier-1 capped at 1 per 3 turns
STT phonetic-recovery sweep (80 Belgian-Dutch medical terms)	`intent_classification_service.py:_STT_NORMALIZATIONS` + wired into `voice_llm_orchestrator.query_stream`	Dutch + Limburgs dialect STT mishearings (elektrocaduwraam → elektrocardiogram, kolonoskopi → colonoscopie, polismografie → polysomnografie, …) recovered before intent classification

The bundle was verified against a 10-persona / 89-turn voice eval (Claude-as-judge): 88/89 turns at production quality. The one outlier (persona_03/T3 cancer-staging interpretation) was investigated and proven empirically (0/8 reproductions across replay + live re-runs) to be OpenAI temp=0 token-level non-determinism, not a deterministic prompt deficiency.

Diagnostic toolchain — trace, replay, SLO

A three-tool diagnostic chain landed in 3bda7f00 and is documented in backend/scripts/VOICE_OPERATOR_RUNBOOK.md. The discipline is enforcement-by-visible-artifact: do not patch a voice prompt rule without using these tools first.

Tool	Question it answers	Where to run
`voice_trace.py <conv_id>`	What actually happened on each turn? Events, telemetry, computed latencies.	`docker exec zol-app` on pilot
`voice_replay.py <conv_id> --turn N --runs M`	Does the current code produce a different answer on this exact input? Mocks RAG with the original payload so the LLM's DECISION is what's tested.	`docker exec zol-app` on pilot
`voice_slo_report.py --since 24h`	Is the modal call meeting time-to-first-audio / clarification-rate / filler-rate targets?	`docker exec zol-app` on pilot

The first production use of this chain (2026-05-23) caught a phantom safety bug — a "no cancer staging" prompt rule that would have shipped on a single-eval-sample signal had the tools not been available. Replay refused 3/3 against the same input; five live re-runs refused 5/5. The proposed rule was withdrawn. See SLO Discipline First Win and the baseline at backend/docs/slo-baseline-2026-05-23.md for the artifacts.

References

backend/app/services/voice/voice_llm_orchestrator.py — the authoritative implementation
voice_agent/agent.py, voice_agent/filler_gate.py — the filler ladder + grace gates
ADR-0051: Agentic VoiceLLMOrchestrator is the Only Voice Path (2026-05-07)
ADR-0049: Thin Voice Architecture (2026-04-30) — superseded stepping stone; documents the 8-stage → 3-stage simplification rationale
ADR-0050: Twilio + LiveKit SIP Integration
ADR-0052: Voice Language Locked at First Utterance
ADR-0053: LLM-First Agentic Voice Pipeline (2026-05-22) — native streaming-with-tools dispatch pattern
Lewis et al. 2020 — original RAG architecture; the orchestrator's search_hospital_kb tool is the agentic equivalent of the RAG retriever stage
LiveKit Agents Documentation — the runtime that hosts voice_agent
Deepgram Nova-3 — production STT model
ElevenLabs Multilingual v2 — production TTS model
backend/scripts/VOICE_OPERATOR_RUNBOOK.md — trace/replay/SLO diagnostic toolchain (committed 2026-05-23 in 3bda7f00)
backend/docs/slo-baseline-2026-05-23.md — reference SLO snapshot for drift detection
Decision-Cost Rubric — pydantic-ai removal case study + SLO discipline first win

The composition principle​

Architectural trade-offs​

Module layout​

Per-turn flow​

Latency budget per stage​

Voice_agent filler ladder and grace tuning​

Tier model​

Grace tuning chain (2026-05-22 → 2026-05-26)​

Three independent skip conditions​

Session-start cache warmup — disabled by default​

Post-refit fixes (2026-05-26 evening)​

Cross-component sequence (voice_agent ↔ backend)​

Summary view — 6 participants​

Full view — 12 participants, dial through BYE​

Where voice differs from chat​

Feature-flag topology (current)​

Per-channel LLM fallback​

Structured-output safety​

Voice ops infrastructure​

Diagnostic toolchain — trace, replay, SLO​

References​