ADR-0049: Thin Voice Architecture — Collapse the Voice Pipeline Around RAG
Master record:
docs/ADR/0049-thin-voice-architecture.md. The master is canonical; this Docusaurus rendering is for in-site navigation.
Date: 2026-04-30
Status: SUPERSEDED by ADR-0051 (2026-05-07) — the deterministic thin pipeline served its purpose as a stepping stone, but pilot calls surfaced rigidity around language-switching and gibberish-input handling that required an LLM-driven decision layer. The agentic VoiceLLMOrchestrator (introduced 2026-05-06) is now the only voice path.
Deciders: Tsunami-max (engineering), pending validation
Relates to: ADR-0033 (superseded by ADR-0048), the voice dialogue manager design, and ADR-0051 (the successor decision).
Context
The voice channel pipeline grew layer-by-layer over Q1–Q2 2026:
| # | Stage | Per-turn cost | Purpose |
|---|---|---|---|
| 1 | Legacy intent classifier (LLM) | ~3.5 s, ~2 500 tokens | Categorise + rewrite query |
| 2 | STT-ambiguity guardrail | ~0 ms | Force OUT_OF_SCOPE on advice-seeking |
| 3 | Conversational intent resolver | ~0–500 ms | Greeting / farewell / handoff (rules + LLM fallback) |
| 4 | Safety gate | ~0 ms | Threshold-based escalation |
| 5 | Terminal-intent shortcut | ~0 ms | farewell / appointment / language switch |
| 6 | Dialogue manager (LLM) | ~2.5 s, ~3 500 tokens | Pick 1 of 6 tools (lookup_faq / search_rag / clarify / repair / transfer / respond) |
| 7 | FAQ pre-check / FAQ-first cache | ~1 ms | Curated regex matches |
| 8 | CLAM preprocessor (LLM) | ~1.5 s | Clarify or rewrite |
| 9 | RAG: embed + retrieve + answer LLM | ~3.5 s | Actually answer the question |
| 10 | VoiceAnswerShaper | ~5 ms | TTS-friendly formatting |
Three to four sequential LLM calls fired before the answer LLM even started. Per-turn p50 latency before any token of the answer was ~10–13 s; the user experienced this as the three-filler ladder ("Even kijken…", "Ik ben nog aan het zoeken…", "Het duurt wat langer…").
The complexity was not abstract — it produced concrete drift incidents in the 24 hours before this ADR was written:
- 2026-04-30:
docker-compose.ymlenvironment:block silently overrodebackend/.env's embedding provider, routing queries through Ollama bge-m3 (1024-dim) against a 1536-dim corpus. Vector retrieval failed silently; FAQ generic answers leaked through for every department-scoped query. Three hours of dev-loop time to diagnose. (Resolved in ADR-0048.) - 2026-04-30: Dialogue manager LLM mis-routed "Wat zijn de parkeertarieven?" to
search_ragdespite an explicit prompt example pointing tolookup_faq. Required a deterministic FAQ pre-check inside_run_dialogue_manager. The LLM-choice layer added latency (~2.5 s per turn) AND a routing-failure surface that regex covered for free. - 2026-04-30:
VoiceAnswerShaper(which converts "089 80 80 80" to "089, 80, 80, 80" so ElevenLabs Dutch voice reads it naturally) was wired intoquery()but not intoquery_stream()or the dialogue-manager dispatch path. Phone numbers in fallback templates reached TTS as raw digits. - Earlier in session:
VOICE_DIALOGUE_MANAGER_ENABLEDenv-flag drift between.envand the running container — same compose-override class as #1. - Earlier in session:
_classify_intent_and_rewriteran twice per voice turn (once in orchestrator step 1, once insiderag_serviceafter the dialogue manager dispatchedsearch_rag).
Pattern: each incident was the complexity itself failing. The architecture's surface area outgrew the team's capacity to keep all eight stages coherent.
The hospital surface area is comparatively narrow:
- Public website chatbot (no PHI, no identifying patient input)
- One language family (Dutch + EN/FR/IT secondary)
- ~5 800 corpus chunks at 1536-dim
text-embedding-3-large - Question types: department lookup, doctor lookup, condition info, treatment info, navigation/practical info, booking/contact, generic small-talk
Most of these are pure RAG questions (Lewis et al., 2020). The "intelligence" of the voice dialogue manager (multi-turn state, frustration tracking, tool selection) was over-fit for the actual question distribution. The Golden eval results (99.0 % pass) demonstrated that RAG answered the content questions reliably on its own.
Decision
Migrate to a "thin voice" architecture: collapse stages 1, 3, 6, 8 (the three LLM-driven routers + CLAM preprocessor) into a single RAG-with-conversation-history call, gated only by:
- A cheap regex pre-filter for terminal intents (greeting / farewell / handoff / safety-keyword refusal)
- The existing FAQ regex table for high-traffic generics (parking, hours, address, main phone) — short-circuit before RAG
- RAG: embed → retrieve → answer LLM (with history + safety + TTS shaping in the system prompt)
- VoiceAnswerShaper: idempotent post-process for TTS phone formatting
Resulting pipeline:
voice query
→ regex pre-filter (~1 ms — terminal intents, safety keywords)
→ FAQ regex match (~1 ms — 5–7 curated entries)
→ RAG (single LLM) (~4 s — embed + retrieve + answer)
→ VoiceAnswerShaper (~5 ms — phone formatting)
TTS
Latency target: ≤ 4 s p50 to first chunk, vs current ~10–13 s. Drift surface: 1 system prompt + 1 FAQ table + 1 corpus, vs current ~5 prompt files + 8 stages.
Migration Plan
This is not a rip-out today. Three deliberate phases, gated on metrics.
Phase A — Define + harden the thin path (1 week)
- Write a thin voice orchestrator as a new code path under a feature flag (
VOICE_THIN_PIPELINE_ENABLED), default off. - Wire it as an A/B branch in
public_websocket.py: 50 % of voice sessions get thin, 50 % get current. Track p50 / p95 time-to-first-chunk, filler-ladder fire rate, voice-turn evaluator scores, and caller-perceived completion (Golden eval voice subset).
Phase B — Validate (1 week of A/B traffic)
Pass criteria — thin must match or beat current on:
- Voice eval scores within ±2 percentage points
- Filler-ladder fire rate ≤ 50 % of current
- p50 latency ≤ 6 s (vs ~10 s baseline)
- Zero increase in safety incidents
Phase C — Migrate (1 week)
If Phase B passes: flip default to thin, mark dialogue manager + CLAM preprocessor stages as deprecated, then remove dead stages two releases later (after a 2-week stability window in production).
If Phase B fails: document which stage carried the load that thin couldn't replicate, retain that stage, collapse the others, and update this ADR before re-attempting.
Consequences
Positive
- ~60 % latency reduction (10–13 s → ~4 s) on every voice turn
- One system prompt is the source of routing truth — instead of five LLM-prompt files that can drift relative to each other
- Two LLM calls saved per turn — ~$0.001 saved per turn × 25 K turns/month ≈ $25/year cost reduction. Trivial dollar saving, meaningful complexity saving.
- Drift surface collapses: every regression in the past 24 hours was a layer-mismatch problem. Fewer layers = fewer mismatches.
- The corpus becomes the answer. RAG with strong embeddings + history is what hospital public-website chatbots actually need — not a 6-tool dialogue manager designed for harder agent tasks.
Negative
- Loses explicit tool-selection visibility. The current dialogue manager emits
tool_calledin logs; a single RAG call doesn't separate intent from retrieval. Mitigation: log the regex-pre-filter outcome + retrieval similarity scores instead. - Refusal logic shifts from a deterministic gate to a system prompt instruction. Mitigation: keep the legacy intent classifier's out-of-scope detection as a parallel post-check on the generated answer. Cheap (~50 ms), defense-in-depth.
- Multi-turn frustration tracking goes away. Mitigation: replicate it with a stateless per-turn signal counter in the regex pre-filter.
- Migration is non-trivial — A/B infrastructure + thin orchestrator + ~3 weeks of validated effort.
Neutral
- Existing RAG service unchanged — the corpus, ingest pipeline, and golden eval are all preserved.
VoiceAnswerShaperunchanged — already does the right TTS shaping job.- FAQ table unchanged — already regex-driven; just moves earlier in the pipeline.
Alternatives Considered
Alternative 1: Keep current architecture, harden the seams
Rejected. Three drift incidents in six weeks; the cost of NOT simplifying compounded with every feature added. The hospital public-website surface didn't justify the architecture's surface area.
Alternative 2: Replace dialogue manager with native OpenAI tool-calling
Rejected for the content-answering surface. Tool-calling is the right answer for agent problems (multi-step actions over external services); voice search over a curated corpus is closer to "ask the documents" than "execute a workflow". (Note: this judgement was reversed by ADR-0051 one week later, after pilot calls surfaced gibberish-input and language-request scenarios that benefited from LLM-driven recovery.)
Alternative 3: Rip out everything but RAG today, no A/B
Rejected. No safety net. Hospital surface = bad place to experiment without controlled comparison.
References
- Drift incidents that motivated this ADR: 2026-04-30 dev-loop session (compose override / dialogue mis-routing / phone TTS / dim mismatch / redundant intent classifier). All five resolved separately, all five symptoms of the same architectural complexity.
- ADR-0033 (superseded): BGE-M3 via Ollama
- ADR-0048:
text-embedding-3-large(foundation this ADR built on) - ADR-0051: the successor decision that replaced the thin pipeline with the agentic
VoiceLLMOrchestrator. - Lewis et al., 2020 — original RAG paper; cited as the architectural precedent the thin pipeline collapses to.