Skip to main content

ADR-0049: Thin Voice Architecture — Collapse the Voice Pipeline Around RAG

Master record: docs/ADR/0049-thin-voice-architecture.md. The master is canonical; this Docusaurus rendering is for in-site navigation.

Date: 2026-04-30 Status: SUPERSEDED by ADR-0051 (2026-05-07) — the deterministic thin pipeline served its purpose as a stepping stone, but pilot calls surfaced rigidity around language-switching and gibberish-input handling that required an LLM-driven decision layer. The agentic VoiceLLMOrchestrator (introduced 2026-05-06) is now the only voice path. Deciders: Tsunami-max (engineering), pending validation Relates to: ADR-0033 (superseded by ADR-0048), the voice dialogue manager design, and ADR-0051 (the successor decision).

Context

The voice channel pipeline grew layer-by-layer over Q1–Q2 2026:

#StagePer-turn costPurpose
1Legacy intent classifier (LLM)~3.5 s, ~2 500 tokensCategorise + rewrite query
2STT-ambiguity guardrail~0 msForce OUT_OF_SCOPE on advice-seeking
3Conversational intent resolver~0–500 msGreeting / farewell / handoff (rules + LLM fallback)
4Safety gate~0 msThreshold-based escalation
5Terminal-intent shortcut~0 msfarewell / appointment / language switch
6Dialogue manager (LLM)~2.5 s, ~3 500 tokensPick 1 of 6 tools (lookup_faq / search_rag / clarify / repair / transfer / respond)
7FAQ pre-check / FAQ-first cache~1 msCurated regex matches
8CLAM preprocessor (LLM)~1.5 sClarify or rewrite
9RAG: embed + retrieve + answer LLM~3.5 sActually answer the question
10VoiceAnswerShaper~5 msTTS-friendly formatting

Three to four sequential LLM calls fired before the answer LLM even started. Per-turn p50 latency before any token of the answer was ~10–13 s; the user experienced this as the three-filler ladder ("Even kijken…", "Ik ben nog aan het zoeken…", "Het duurt wat langer…").

The complexity was not abstract — it produced concrete drift incidents in the 24 hours before this ADR was written:

  1. 2026-04-30: docker-compose.yml environment: block silently overrode backend/.env's embedding provider, routing queries through Ollama bge-m3 (1024-dim) against a 1536-dim corpus. Vector retrieval failed silently; FAQ generic answers leaked through for every department-scoped query. Three hours of dev-loop time to diagnose. (Resolved in ADR-0048.)
  2. 2026-04-30: Dialogue manager LLM mis-routed "Wat zijn de parkeertarieven?" to search_rag despite an explicit prompt example pointing to lookup_faq. Required a deterministic FAQ pre-check inside _run_dialogue_manager. The LLM-choice layer added latency (~2.5 s per turn) AND a routing-failure surface that regex covered for free.
  3. 2026-04-30: VoiceAnswerShaper (which converts "089 80 80 80" to "089, 80, 80, 80" so ElevenLabs Dutch voice reads it naturally) was wired into query() but not into query_stream() or the dialogue-manager dispatch path. Phone numbers in fallback templates reached TTS as raw digits.
  4. Earlier in session: VOICE_DIALOGUE_MANAGER_ENABLED env-flag drift between .env and the running container — same compose-override class as #1.
  5. Earlier in session: _classify_intent_and_rewrite ran twice per voice turn (once in orchestrator step 1, once inside rag_service after the dialogue manager dispatched search_rag).

Pattern: each incident was the complexity itself failing. The architecture's surface area outgrew the team's capacity to keep all eight stages coherent.

The hospital surface area is comparatively narrow:

  • Public website chatbot (no PHI, no identifying patient input)
  • One language family (Dutch + EN/FR/IT secondary)
  • ~5 800 corpus chunks at 1536-dim text-embedding-3-large
  • Question types: department lookup, doctor lookup, condition info, treatment info, navigation/practical info, booking/contact, generic small-talk

Most of these are pure RAG questions (Lewis et al., 2020). The "intelligence" of the voice dialogue manager (multi-turn state, frustration tracking, tool selection) was over-fit for the actual question distribution. The Golden eval results (99.0 % pass) demonstrated that RAG answered the content questions reliably on its own.

Decision

Migrate to a "thin voice" architecture: collapse stages 1, 3, 6, 8 (the three LLM-driven routers + CLAM preprocessor) into a single RAG-with-conversation-history call, gated only by:

  1. A cheap regex pre-filter for terminal intents (greeting / farewell / handoff / safety-keyword refusal)
  2. The existing FAQ regex table for high-traffic generics (parking, hours, address, main phone) — short-circuit before RAG
  3. RAG: embed → retrieve → answer LLM (with history + safety + TTS shaping in the system prompt)
  4. VoiceAnswerShaper: idempotent post-process for TTS phone formatting

Resulting pipeline:

voice query
→ regex pre-filter (~1 ms — terminal intents, safety keywords)
→ FAQ regex match (~1 ms — 5–7 curated entries)
→ RAG (single LLM) (~4 s — embed + retrieve + answer)
→ VoiceAnswerShaper (~5 ms — phone formatting)
TTS

Latency target: ≤ 4 s p50 to first chunk, vs current ~10–13 s. Drift surface: 1 system prompt + 1 FAQ table + 1 corpus, vs current ~5 prompt files + 8 stages.

Migration Plan

This is not a rip-out today. Three deliberate phases, gated on metrics.

Phase A — Define + harden the thin path (1 week)

  1. Write a thin voice orchestrator as a new code path under a feature flag (VOICE_THIN_PIPELINE_ENABLED), default off.
  2. Wire it as an A/B branch in public_websocket.py: 50 % of voice sessions get thin, 50 % get current. Track p50 / p95 time-to-first-chunk, filler-ladder fire rate, voice-turn evaluator scores, and caller-perceived completion (Golden eval voice subset).

Phase B — Validate (1 week of A/B traffic)

Pass criteria — thin must match or beat current on:

  • Voice eval scores within ±2 percentage points
  • Filler-ladder fire rate ≤ 50 % of current
  • p50 latency ≤ 6 s (vs ~10 s baseline)
  • Zero increase in safety incidents

Phase C — Migrate (1 week)

If Phase B passes: flip default to thin, mark dialogue manager + CLAM preprocessor stages as deprecated, then remove dead stages two releases later (after a 2-week stability window in production).

If Phase B fails: document which stage carried the load that thin couldn't replicate, retain that stage, collapse the others, and update this ADR before re-attempting.

Consequences

Positive

  • ~60 % latency reduction (10–13 s → ~4 s) on every voice turn
  • One system prompt is the source of routing truth — instead of five LLM-prompt files that can drift relative to each other
  • Two LLM calls saved per turn — ~$0.001 saved per turn × 25 K turns/month ≈ $25/year cost reduction. Trivial dollar saving, meaningful complexity saving.
  • Drift surface collapses: every regression in the past 24 hours was a layer-mismatch problem. Fewer layers = fewer mismatches.
  • The corpus becomes the answer. RAG with strong embeddings + history is what hospital public-website chatbots actually need — not a 6-tool dialogue manager designed for harder agent tasks.

Negative

  • Loses explicit tool-selection visibility. The current dialogue manager emits tool_called in logs; a single RAG call doesn't separate intent from retrieval. Mitigation: log the regex-pre-filter outcome + retrieval similarity scores instead.
  • Refusal logic shifts from a deterministic gate to a system prompt instruction. Mitigation: keep the legacy intent classifier's out-of-scope detection as a parallel post-check on the generated answer. Cheap (~50 ms), defense-in-depth.
  • Multi-turn frustration tracking goes away. Mitigation: replicate it with a stateless per-turn signal counter in the regex pre-filter.
  • Migration is non-trivial — A/B infrastructure + thin orchestrator + ~3 weeks of validated effort.

Neutral

  • Existing RAG service unchanged — the corpus, ingest pipeline, and golden eval are all preserved.
  • VoiceAnswerShaper unchanged — already does the right TTS shaping job.
  • FAQ table unchanged — already regex-driven; just moves earlier in the pipeline.

Alternatives Considered

Alternative 1: Keep current architecture, harden the seams

Rejected. Three drift incidents in six weeks; the cost of NOT simplifying compounded with every feature added. The hospital public-website surface didn't justify the architecture's surface area.

Alternative 2: Replace dialogue manager with native OpenAI tool-calling

Rejected for the content-answering surface. Tool-calling is the right answer for agent problems (multi-step actions over external services); voice search over a curated corpus is closer to "ask the documents" than "execute a workflow". (Note: this judgement was reversed by ADR-0051 one week later, after pilot calls surfaced gibberish-input and language-request scenarios that benefited from LLM-driven recovery.)

Alternative 3: Rip out everything but RAG today, no A/B

Rejected. No safety net. Hospital surface = bad place to experiment without controlled comparison.

References

  • Drift incidents that motivated this ADR: 2026-04-30 dev-loop session (compose override / dialogue mis-routing / phone TTS / dim mismatch / redundant intent classifier). All five resolved separately, all five symptoms of the same architectural complexity.
  • ADR-0033 (superseded): BGE-M3 via Ollama
  • ADR-0048: text-embedding-3-large (foundation this ADR built on)
  • ADR-0051: the successor decision that replaced the thin pipeline with the agentic VoiceLLMOrchestrator.
  • Lewis et al., 2020 — original RAG paper; cited as the architectural precedent the thin pipeline collapses to.