Skip to main content

Voice Channel — Architecture

The voice channel is the production telephony surface for the ZOL intelligent search system. A caller dialling the Belgian PSTN number reaches a self-hosted SIP gateway, which bridges into a LiveKit room (@livekit_agents_docs) where a voice_agent worker runs Deepgram Nova-3 streaming ASR (@deepgram_nova3) on the inbound audio and ElevenLabs Multilingual v2 (@elevenlabs_multilingual_v2) on the outbound. Per-turn cognition is delegated over a WebSocket to the backend, where VoiceLLMOrchestrator runs the regex pre-filter → GPT-4.1 tool-call loop → safety post-filter → answer-shaper sequence described below.

This page documents the composition principle, the module layout, the per-turn flow, the cross-component sequence, and the architectural trade-offs that shaped the current design.

The composition principle

The voice channel is agentic-only: a single GPT-4.1 agent with three tools is the cognition layer. ADR-0049 and ADR-0051 retired the legacy 8-stage VoiceOrchestrator (deleted in commit 158d793, ~7,000 LOC removed) in favour of an LLM agent that calls the existing RAGService.query_stream directly via a search_hospital_kb tool, with channel="voice" set on the request.

ADR-0053 (2026-05-22) further established native OpenAI streaming-with-tools as the dispatch pattern: each tool-decision iteration is a single chat.completions.create(stream=True, tools=_TOOLS, tool_choice="auto") call rather than the earlier two-call pattern (non-streaming tool decision → separate streaming final). For direct-response queries ("Do you speak English?", "Wat zijn jullie openingsuren?"), this halves OpenAI round-trips: ONE call from query to last token. For tool-using queries, the loop is N+1 calls (one per iteration, one for the final synthesis), all streamed. Both cases produce a chunk event stream that the voice_agent transforms into sentence-grain session.say() invocations via a regex-bounded sentence buffer.

voice_agent (livekit-agents)
→ backend WS /ws/public-query
→ VoiceLLMOrchestrator
1. regex pre-filter (classify_terminal)
2. GPT-4.1 tool loop (max 3 iterations):
search_hospital_kb → RAGService.query_stream (channel="voice")
transfer_to_helpdesk
end_call
3. safety post-filter (regex on LLM output)
4. VoiceAnswerShaper (TTS phone formatting)
5. medical-disclaimer prepender (if Stage 4 detects medical content)

RAGService.query_stream is unchanged from the chat channel. Setting channel="voice" activates voice-specific behaviours inside the RAG pipeline: the Value Framework affinity rerank (@cormack2009rrf is the rank-fusion lineage; we extend it with a categorical-affinity multiplier — see Value Framework), citation derivation from chunks rather than inline [N] markers (see Citation Pipeline), and a voice-shaped LLM system prompt from app.prompts.build_voice_llm_orchestrator_system_prompt.

Architectural trade-offs

Three foundational decisions define the voice channel's shape; each is captured in an ADR with the alternatives that were considered and rejected.

DecisionChosenAlternatives consideredRejected because
Cognition topologyAgentic LLM with tools (ADR-0051)8-stage deterministic pipeline (Phase A VoiceOrchestrator); thin pipeline (regex pre-filter → FAQ → RAG, no LLM agent)The 8-stage pipeline accumulated ~7 000 LOC, six dangling feature flags, and a sub-5% cache hit rate on speculative-STT; the thin pipeline could not handle compound queries ("which doctor at cardiology AND what are the visiting hours") without re-introducing intent state. The agentic LLM lets the model decide tool dispatch on a per-turn basis without persisted dialogue state.
Telephony stackSelf-hosted Twilio SIP + LiveKit (ADR-0050)LiveKit Cloud SIP managed gateway; Twilio voice with no LiveKitLiveKit Cloud SIP is $0.30–0.50/participant-hour ($375–625/month at the projected 25K queries/month scale); self-hosted is $0 marginal. Twilio voice without LiveKit ties cognition to TwiML, foreclosing the agent runtime. (RFC 3261)
Mid-call languageLocked at first utterance (ADR-0052)Multi-language Deepgram on every turn; switch tool for explicit language requestsMulti-language Deepgram degraded Flemish accuracy materially ("bezoekuren""bezukjuren"); the switch tool was structurally broken because Deepgram emits zero transcripts on speech in the locked-out language, leaving no signal for any detector. See Language Locking.
Citation strategyChunk-derived fallback (no inline markers in answer)Inline [N] markers as on chatTTS reads [1] as "open bracket one close bracket"; the voice prompt strips markers, so the chat channel's marker-extractor produces empty citations. See Citation Pipeline.

Module layout

backend/app/
├── api/
│ ├── query.py # channel-dispatch → VoiceLLMOrchestrator
│ └── public_websocket.py # channel-dispatch → VoiceLLMOrchestrator
├── models/
│ └── schemas.py # QueryRequest.channel + detected_language
├── config.py # voice_llm_orchestrator_enabled (no-op, ADR-0051)
├── prompts.py # build_voice_llm_orchestrator_system_prompt
├── services/
│ ├── rag_service.py # untouched; called via search_hospital_kb
│ └── voice/
│ ├── voice_llm_orchestrator.py # THE integration seam (ADR-0051)
│ ├── voice_thin_pre_filter.py # classify_terminal + shared helpers
│ ├── voice_routing_dispatch.py # unified voice_routing_rules dispatcher (Sprint E / Wave A)
│ ├── voice_answer_shaper.py # TTS phone formatting + disclaimer prepender
│ ├── voice_faq_renderers.py # DB-driven FAQ answer renderers
│ ├── voice_pii_redaction.py # caller-ID pseudonymisation
│ ├── voice_turn_evaluator.py # per-turn LLM-as-judge scorer (structured_call)
│ ├── sip_concurrency_limiter.py # max-concurrent-calls limiter
│ ├── sip_rate_limiter.py # per-caller rate limiter (Redis)
│ └── tenant_overlays/ # multi-tenant FAQ + STT overlay package
│ ├── __init__.py # public get_overlay()
│ ├── loader.py # YAML loader + tenant resolution
│ ├── registry.py # in-process LRU
│ └── schema.py # Pydantic v2 overlay schema
│ └── value_framework/ # intent-to-category affinity rerank
│ ├── __init__.py
│ ├── affinity.py # apply_intent_category_affinity
│ ├── category_classifier.py # classify_chunk_category
│ ├── telemetry.py # record_category_mismatch
│ └── unit_mismatch.py # admit unit-mismatch gaps

Per-turn flow

The tool-call loop is bounded by voice_llm_orchestrator_max_tool_iterations (default 3). On overflow the orchestrator emits a fixed transfer text rather than continuing to spend tokens. A second short-circuit guards against the gibberish-rephrase loop pattern from the 2026-05-07 traffic: two consecutive search_hospital_kb calls returning found=False force-transfers to the helpdesk (voice_llm_orchestrator.py:519–556).

Latency budget per stage

The per-turn latency budget below is taken from local dev measurement; production pilot measurement is pending. * markers indicate stages whose timings have not yet been pinned to a histogram on the pilot.

StageLocal-dev p50Notes
classify_terminal (regex)< 1 msPure-Python regex; deterministic
GPT-4.1 first-token (tool decision)300–600 ms*OpenAI chat completion w/ tools; bounded by network + LLM provider
search_hospital_kb round-trip (RAG inner loop)600–1 200 ms*Embedding + pgvector + BM25 + rerank + LLM stream
Safety post-filter (regex)< 1 msPure-Python regex
VoiceAnswerShaper.shape5–20 ms*Six regex transforms + medical-content detector
ElevenLabs TTS first audio chunk200–400 ms*ElevenLabs Multilingual v2 streaming

* Per Beyer et al. 2016 §4 (Service Level Objectives), latency SLOs should be written at p95/p99 — not the mean. A pilot measurement pass (Phase 5 of the readiness plan) will replace the dev p50s above with pilot p95s. Nielsen 1993

Voice_agent filler ladder and grace tuning

The voice_agent runs a three-tier filler ladder concurrently with the backend RAG dispatch. Each tier asks the pure voice_agent.filler_gate module: "given current state, should I fire?" The decision is small (a few boolean conditions) but its correctness has bitten production multiple times — captured here for future maintainers.

Tier model

TierGrace before firingPurposeExample phrase
Tier 11500 ms (was 800 ms before commit 38d7f6be)Natural pause bridge — masks the gap between voice_turn_start and the first sentence event from backend"Een ogenblikje", "Let me check that for you"
Tier 24 000 msAcknowledge ongoing search"Almost there, just another second."
Tier 310 000 msUnusual delay — acknowledge once, then go silent"Almost ready, I have nearly all the information."

Each tier is cancelled the moment the first streaming sentence event arrives from backend — the gate's streamed_answer_spoken predicate flips True inside _on_streaming_chunk and subsequent tiers no-op.

Grace tuning chain (2026-05-22 → 2026-05-26)

The tier-1 grace was tuned four times in rapid succession after the LLM-first agentic pipeline (ADR-0053) shifted the latency distribution. Each retune is preserved here because the reasoning chain is non-obvious from the constants alone:

CommitDateChangeReason
(pre-trust-LLM)800 msOriginal Task 8 / 2026-05-22 spec — sat between the 600 ms human-noticed-silence floor (arXiv 2507.22352) and the ~1 s mark where the answer LLM typically produced its first complete sentence under the older two-call pattern
38d7f6be2026-05-23800 → 1500 msPilot SIP call with Do you speak English? fired a spurious "Let me search that for you" filler before the actual Yes, I speak English response. Backend turn was 622 ms, but LiveKit + Twilio SIP transport adds 50-200 ms each way, landing _streamed_answer_spoken=True at ~700-1100 ms — right at the original threshold. The third clock (transport latency on the SIP path) was unaccounted for.
a54ce8de2026-05-23+ language_switch_recent gateThe first turn after an NL→EN language switch has a ~3.3 s gap between voice_turn_start and the backend WS receipt — STT/TTS plugin reload eats the wait. The 1500 ms grace expired during the reload window. Solution: skip tier-1 entirely if a language switch happened within LANGUAGE_SWITCH_GRACE_WINDOW_S (default 5 s). Tier-2 at 4 s still fires for genuine stalls.
a07572712026-05-26Same-language probe no longer flags as a switchPilot trace 62321b74 (2026-05-26 11:46 UTC) showed Turn 1 of every call firing tier-2 at +7003 ms instead of tier-1 at +1500 ms. Root cause: the first-utterance language probe (on_user_turn_completed:~998) calls _switch_language(detected) even when detected == current to flip Deepgram STT from multi-mode to single-language mode. The old code set _last_language_switch_at unconditionally — so the language_switch_recent gate added in a54ce8de false-positived on every Turn 1, suppressing tier-1 for the entire 5 s grace window. Fix: gate the timestamp assignment inside the existing _previous_language != target_language block so the same-language Deepgram-mode flip no longer poses as a switch. Real cross-language switches still set the timestamp (TTS reload protection preserved). Saves ~5.5 s of perceived silence on the first turn of 100% of calls.

Three independent skip conditions

Tier-1 fires only when ALL of these are false:

  1. rag_task_done — backend has returned the full response (non-streaming path)
  2. streamed_answer_spoken — first sentence already arrived; agent is already speaking
  3. is_terminal_phrase — caller said farewell / handoff / appointment; backend short-circuits, a filler would auto-complete before the terminal response
  4. language_switch_recent — STT/TTS reload in progress; the rag_task hasn't even reached backend yet. Important nuance: this flag is set by _switch_language only when the language actually changed (per fix a0757271, 2026-05-26). The first-utterance probe that flips Deepgram from multi-mode to single-language mode with the SAME target language does not trip this gate, so Turn 1 of every call now fires tier-1 at +1500 ms instead of tier-2 at +7000 ms.

The function lives in voice_agent/filler_gate.py as a pure module — unit-tested without LiveKit fixtures, runs instantly. Adding a fifth skip condition would touch only the gate predicate + a new test; the dispatch loop in agent.py stays untouched.

Session-start cache warmup — disabled by default

HospitalVoiceAgent.warmup_cache() exists but is gated behind VOICE_WARMUP_CACHE_ENABLED (default false). When enabled, the agent fires the 4 most-common-question queries at the backend immediately after greeting, in parallel, to pre-populate the semantic cache. The intent was sub-100 ms first answers when the caller's question matched one of the warmup queries.

In practice the warmup queries serialise on the backend's OpenAI client / WS handler chain — each takes 4-7 s, sequentially — and the caller's actual first query waits 15-20 s behind the warmup batch. The voice_agent's own filler ladder then plays tier-2 and tier-3 fillers because the backend genuinely IS taking that long. The docstring at voice_agent/agent.py:warmup_cache calls this out explicitly: "caller hears 3 fillers and an offer to transfer before the agent speaks the actual answer."

The flag was re-enabled briefly during streaming-TTFT validation on 2026-05-22 and disabled again on 2026-05-23 after a pilot SIP test reproduced the exact failure mode the docstring described. The underlying serialisation has not been fixed; the flag stays at false until a backend WS / OpenAI-client concurrency investigation produces a fix.

Post-refit fixes (2026-05-26 evening)

Two more production fixes landed the same day as the a0757271 grace re-tune, on different layers but originating from the same pilot trace session. They are listed here rather than in the grace-tuning chain because neither touches the filler ladder — they belong to the wider "things that happen between STT and the LLM" surface.

CommitLayerChangeReason
eb5ebf66voice_agent / _is_short_clarificationBare emphasis affirmations ("ja graag", "yes please", "graag", "ja alsjeblieft", "yes thank you") no longer trigger FAQ-followup context-carry. The _EMPHASIS_ONLY_NL_EN set in voice_agent/agent.py:446 short-circuits the carry; the turn falls through to the LLM as a fresh utterance.Pilot trace 92e11ea3 turn 12 (2026-05-26) showed the agent stitching the previous question ("wat zou ik dat zijn…") with the caller's bare "ja graag" into a malformed compound query. Emphasis-only affirmations carry no new content to combine with the prior turn.
1be07148backend ZOL tenant overlay (zol.yaml)New public_service_numbers_lookup rule in voice_routing_rules matches 112 / 1733 / huisartsenwacht / noodnummer / spoednummer (plus EN/FR/IT equivalents) and returns deterministic text listing the three Belgian public-service numbers (112 EU emergency, 1733 huisartsenwacht / out-of-hours GP, 089 80 80 80 ZOL spoed).Pilot voice call 62321b74 turns 22–23 (2026-05-26) showed the agent refusing to read out 1733 and 112 as "specifiek medisch advies". These are national public-service numbers — refusing them is unsafe. The rule fires inside voice_routing_dispatch.dispatch() BEFORE the LLM sees the query. Regular doctor / department phone lookups still flow through RAG.

The second fix sits on voice_routing_dispatch.py — the unified Sprint-E rule dispatcher whose schema (crisis → emergency → pii_refusal → identity → clarification → symptom_triage → out_of_scope_redirect → faq) is loaded from _yaml/_defaults.yaml plus per-tenant overlays. Any rule whose pattern matches short-circuits the LLM with the rule's response_text; the FALLTHROUGH set is what reaches the GPT-4.1 tool loop described above.

Cross-component sequence (voice_agent ↔ backend)

Two views of the same conversation are maintained side by side. The summary view is the right diagram when the audience needs to see the orchestrator–RAG split and the role of the Value Framework. The full view is the right diagram when the audience needs to see every external SaaS call, every short-circuit before the LLM, the tool loop, the streaming-audio-return pattern, and the post-turn evaluator — i.e. the operational reality of a single SIP call from dial to BYE.

Summary view — 6 participants

The Value Framework is a child of RAGService.query_stream, not a peer of the orchestrator. The orchestrator never calls apply_intent_category_affinity directly. This separation matters because the chat channel benefits from the same affinity rerank without requiring orchestrator changes.

Full view — 12 participants, dial through BYE

This view expands the summary in five ways. (1) Two short-circuits before the LLMclassify_terminal and the unified voice_routing_dispatch (rules: crisis → emergency → pii_refusal → identity → clarification → symptom_triage → out_of_scope_redirect → faq, including the public_service_numbers_lookup rule added by 1be07148 for 112 / 1733 / spoed) — together handle a large fraction of real-world traffic with zero LLM cost. (2) The GPT-4.1 tool loop is the actual control flow of agentic-only voice (max 3 iterations, with a 2-consecutive-empty-search escalation guard). (3) The RAG-internal sub-pipeline reached via the search_hospital_kb tool — semantic cache → intent classification → hybrid retrieval (pgvector + BM25 + taxonomy) → Stage 5b VF affinity rerank → Stage 5c doctor-list injection → context assembly → response generation — is drawn explicitly. (4) The external SaaS calls (Deepgram, OpenAI, ElevenLabs) get their own lanes because they dominate the latency budget. (5) The concurrent filler ladder that masks backend latency on the caller side gets its own par block.

Honesty caveats on the full view. Three places where the diagram simplifies real behaviour and a future maintainer should be aware:

  1. par block for the filler ladder depicts three concurrent timers, which is correct, but a real call only ever fires the tier(s) whose timeout actually expires before streamed_answer_spoken flips. So in practice you'll see one filler fire (or none if the backend is fast enough), not all three. The diagram shows concurrent dispatch, not typical observed behaviour.
  2. Block F (post-turn evaluator) is drawn as a sequential block after E for visual clarity, but in practice it runs as a fire-and-forget asyncio task — caller-perceived latency does not include it.
  3. The DB lane rolls up four very different stores: pgvector (semantic similarity), BM25 tsvector (lexical), taxonomy (relational), and telemetry tables (conversation_messages, category_mismatch_telemetry, voice_turn_evaluations). If a presentation needs to make the storage diversity explicit, split this lane.

Where voice differs from chat

ConcernChat channelVoice channel
CitationsMarker-based extractor ([1], [2], …)Chunk-derived fallback — voice prompt strips inline markers (un-speakable); _qs_finalize derives citations from retrieved chunks directly
LanguagePer-query detection, can switch any turnLocked at first STT-confirmed utterance for the duration of the call (ADR-0052)
LLM modelgpt-4.1 (quality-first)gpt-4.1 with tool-use (latency-conscious; max 3 tool iterations)
Follow-up suggestionsRendered in widgetSuppressed — no UI to display them; voice shaper strips them
MarkdownRendered by widgetStripped — ElevenLabs speaks raw text
Answer lengthUp to 500 tokensEnforced ≤ 2 sentences by system prompt + VoiceAnswerShaper.max_sentences=2
Error routingHTTP 4xx / chat error messagetransfer_to_helpdesk tool call → SIP REFER
STTNot applicableDeepgram Nova-3, Dutch-primary, language locked at first utterance
DisclaimerShown inlineSpoken; auto-detected from answer text by _detect_medical_content_in_answer() (Wave 2.C-tail D2)
Structured-output safetystructured_call thin helper (~190 LOC over AsyncOpenAI) at 8 LLM call sites (turn evaluator, intent classifier, etc.). Replaced Pydantic AI on 2026-05-12 — see Decision-Cost RubricSame; the change applies channel-uniformly

Feature-flag topology (current)

SettingDefaultStatus
voice_llm_orchestrator_enabledtrueNo-op — kept only so existing .env files don't trip extra="forbid". Will be removed in follow-up cleanup per ADR-0051.
voice_disclaimer_enabledtrueSpoken medical disclaimer prepended on detected medical content (post-LLM, post-shape).
voice_stt_ambiguity_guardrail_enabledtrueDangling. The stt_ambiguity_guardrail.py module was deleted in 158d793; safety-refusal is now always-on inside classify_terminal(). The setting is read but has no consumer. Removal pending.
voice_escalation_confidence_threshold0.65Dangling. Consumer voice_safety_gate.py was deleted in 158d793. Removal pending.
voice_conversational_intent_llm_model"gpt-4.1-nano"Dangling. Consumer conversational_intent_resolver.py was deleted in 158d793. Removal pending.
voice_llm_orchestrator_max_tool_iterations3Tool-call iteration cap. On overflow the orchestrator emits a fixed transfer text.

The four dangling settings form a cleanup batch: their .env.example entries, Settings fields, and any docs referencing them as live should be removed in a follow-up sprint. The settings persist for now because removing them mid-deploy would trip extra="forbid" on existing operator .env files.

Per-channel LLM fallback

The llm_fallback_chain utility is still registered but the agentic path goes directly to GPT-4.1. On any tool/LLM exception, VoiceLLMOrchestrator returns conversational_intent="escalate" and the voice_agent SIP-transfers the caller to the ZOL helpdesk. There is no "fall back to thin" path — ADR-0051 removed it.

Structured-output safety

Eight LLM call sites — including IntentClassificationService.classify_intent, ConversationClassifierService, and VoiceTurnEvaluator — use a thin structured_call(prompt, output_model) helper that wraps the OpenAI client with response_format=json_object, validates via pydantic.BaseModel.model_validate_json(), and retries once on a ValidationError before raising a typed fallback. Malformed JSON never reaches downstream code, and the defensive json.loads try/except blocks that older call sites carried are gone. VoiceTurnEvaluator (the per-turn LLM-as-judge scorer) depends on this contract: a malformed score would otherwise silently degrade the diagnostic.

Voice ops infrastructure

A two-day calibration sprint ending 2026-05-23 hardened the voice path against the modes that surfaced in eight weeks of pilot SIP testing. Five changes shipped as a bundle:

ChangeFileReason
Rule 4.5 — no repeated clarificationsprompts.py system promptVoice eval showed agents asking the same clarification 3 times across a single conversation; rule forbids it and commits to a search-or-handoff after one clarification
Temperature 0.3 → 0.0config.py:voice_llm_orchestrator_temperatureRemoves per-turn answer variance; the agent's job is to commit to a single grounded response, not to sample creatively
Rule 6.5 — list what the corpus listsprompts.py system promptProcedure questions ("How does an MRI work?") now surface the corpus's actual content, not generic answers; turns procedure questions into search hits rather than reflexive disclaimers
Tier-0 ack removed + tier-1 session rate limitvoice_agent/agent.py, filler_gate.pyPilot SIP calls accumulated "mhm" acks that read as nervous; tier-0 disabled, tier-1 capped at 1 per 3 turns
STT phonetic-recovery sweep (80 Belgian-Dutch medical terms)intent_classification_service.py:_STT_NORMALIZATIONS + wired into voice_llm_orchestrator.query_streamDutch + Limburgs dialect STT mishearings (elektrocaduwraam → elektrocardiogram, kolonoskopi → colonoscopie, polismografie → polysomnografie, …) recovered before intent classification

The bundle was verified against a 10-persona / 89-turn voice eval (Claude-as-judge): 88/89 turns at production quality. The one outlier (persona_03/T3 cancer-staging interpretation) was investigated and proven empirically (0/8 reproductions across replay + live re-runs) to be OpenAI temp=0 token-level non-determinism, not a deterministic prompt deficiency.

Diagnostic toolchain — trace, replay, SLO

A three-tool diagnostic chain landed in 3bda7f00 and is documented in backend/scripts/VOICE_OPERATOR_RUNBOOK.md. The discipline is enforcement-by-visible-artifact: do not patch a voice prompt rule without using these tools first.

ToolQuestion it answersWhere to run
voice_trace.py <conv_id>What actually happened on each turn? Events, telemetry, computed latencies.docker exec zol-app on pilot
voice_replay.py <conv_id> --turn N --runs MDoes the current code produce a different answer on this exact input? Mocks RAG with the original payload so the LLM's DECISION is what's tested.docker exec zol-app on pilot
voice_slo_report.py --since 24hIs the modal call meeting time-to-first-audio / clarification-rate / filler-rate targets?docker exec zol-app on pilot

The first production use of this chain (2026-05-23) caught a phantom safety bug — a "no cancer staging" prompt rule that would have shipped on a single-eval-sample signal had the tools not been available. Replay refused 3/3 against the same input; five live re-runs refused 5/5. The proposed rule was withdrawn. See SLO Discipline First Win and the baseline at backend/docs/slo-baseline-2026-05-23.md for the artifacts.

References

  • backend/app/services/voice/voice_llm_orchestrator.py — the authoritative implementation
  • voice_agent/agent.py, voice_agent/filler_gate.py — the filler ladder + grace gates
  • ADR-0051: Agentic VoiceLLMOrchestrator is the Only Voice Path (2026-05-07)
  • ADR-0049: Thin Voice Architecture (2026-04-30) — superseded stepping stone; documents the 8-stage → 3-stage simplification rationale
  • ADR-0050: Twilio + LiveKit SIP Integration
  • ADR-0052: Voice Language Locked at First Utterance
  • ADR-0053: LLM-First Agentic Voice Pipeline (2026-05-22) — native streaming-with-tools dispatch pattern
  • Lewis et al. 2020 — original RAG architecture; the orchestrator's search_hospital_kb tool is the agentic equivalent of the RAG retriever stage
  • LiveKit Agents Documentation — the runtime that hosts voice_agent
  • Deepgram Nova-3 — production STT model
  • ElevenLabs Multilingual v2 — production TTS model
  • backend/scripts/VOICE_OPERATOR_RUNBOOK.md — trace/replay/SLO diagnostic toolchain (committed 2026-05-23 in 3bda7f00)
  • backend/docs/slo-baseline-2026-05-23.md — reference SLO snapshot for drift detection
  • Decision-Cost Rubric — pydantic-ai removal case study + SLO discipline first win