ADR-0049: Thin Voice Architecture — Collapse the Voice Pipeline Around RAG

Master record: docs/ADR/0049-thin-voice-architecture.md. The master is canonical; this Docusaurus rendering is for in-site navigation.

Date: 2026-04-30 Status: SUPERSEDED by ADR-0051 (2026-05-07) — the deterministic thin pipeline served its purpose as a stepping stone, but pilot calls surfaced rigidity around language-switching and gibberish-input handling that required an LLM-driven decision layer. The agentic VoiceLLMOrchestrator (introduced 2026-05-06) is now the only voice path. Deciders: Tsunami-max (engineering), pending validation Relates to: ADR-0033 (superseded by ADR-0048), the voice dialogue manager design, and ADR-0051 (the successor decision).

Context

The voice channel pipeline grew layer-by-layer over Q1–Q2 2026:

#	Stage	Per-turn cost	Purpose
1	Legacy intent classifier (LLM)	~3.5 s, ~2 500 tokens	Categorise + rewrite query
2	STT-ambiguity guardrail	~0 ms	Force OUT_OF_SCOPE on advice-seeking
3	Conversational intent resolver	~0–500 ms	Greeting / farewell / handoff (rules + LLM fallback)
4	Safety gate	~0 ms	Threshold-based escalation
5	Terminal-intent shortcut	~0 ms	farewell / appointment / language switch
6	Dialogue manager (LLM)	~2.5 s, ~3 500 tokens	Pick 1 of 6 tools (`lookup_faq` / `search_rag` / `clarify` / `repair` / `transfer` / `respond`)
7	FAQ pre-check / FAQ-first cache	~1 ms	Curated regex matches
8	CLAM preprocessor (LLM)	~1.5 s	Clarify or rewrite
9	RAG: embed + retrieve + answer LLM	~3.5 s	Actually answer the question
10	VoiceAnswerShaper	~5 ms	TTS-friendly formatting

Three to four sequential LLM calls fired before the answer LLM even started. Per-turn p50 latency before any token of the answer was ~10–13 s; the user experienced this as the three-filler ladder ("Even kijken…", "Ik ben nog aan het zoeken…", "Het duurt wat langer…").

The complexity was not abstract — it produced concrete drift incidents in the 24 hours before this ADR was written:

2026-04-30: docker-compose.yml environment: block silently overrode backend/.env's embedding provider, routing queries through Ollama bge-m3 (1024-dim) against a 1536-dim corpus. Vector retrieval failed silently; FAQ generic answers leaked through for every department-scoped query. Three hours of dev-loop time to diagnose. (Resolved in ADR-0048.)
2026-04-30: Dialogue manager LLM mis-routed "Wat zijn de parkeertarieven?" to search_rag despite an explicit prompt example pointing to lookup_faq. Required a deterministic FAQ pre-check inside _run_dialogue_manager. The LLM-choice layer added latency (~2.5 s per turn) AND a routing-failure surface that regex covered for free.
2026-04-30: VoiceAnswerShaper (which converts "089 80 80 80" to "089, 80, 80, 80" so ElevenLabs Dutch voice reads it naturally) was wired into query() but not into query_stream() or the dialogue-manager dispatch path. Phone numbers in fallback templates reached TTS as raw digits.
Earlier in session: VOICE_DIALOGUE_MANAGER_ENABLED env-flag drift between .env and the running container — same compose-override class as #1.
Earlier in session: _classify_intent_and_rewrite ran twice per voice turn (once in orchestrator step 1, once inside rag_service after the dialogue manager dispatched search_rag).

Pattern: each incident was the complexity itself failing. The architecture's surface area outgrew the team's capacity to keep all eight stages coherent.

The hospital surface area is comparatively narrow:

Public website chatbot (no PHI, no identifying patient input)
One language family (Dutch + EN/FR/IT secondary)
~5 800 corpus chunks at 1536-dim text-embedding-3-large
Question types: department lookup, doctor lookup, condition info, treatment info, navigation/practical info, booking/contact, generic small-talk

Most of these are pure RAG questions (Lewis et al., 2020). The "intelligence" of the voice dialogue manager (multi-turn state, frustration tracking, tool selection) was over-fit for the actual question distribution. The Golden eval results (99.0 % pass) demonstrated that RAG answered the content questions reliably on its own.

Decision

Migrate to a "thin voice" architecture: collapse stages 1, 3, 6, 8 (the three LLM-driven routers + CLAM preprocessor) into a single RAG-with-conversation-history call, gated only by:

A cheap regex pre-filter for terminal intents (greeting / farewell / handoff / safety-keyword refusal)
The existing FAQ regex table for high-traffic generics (parking, hours, address, main phone) — short-circuit before RAG
RAG: embed → retrieve → answer LLM (with history + safety + TTS shaping in the system prompt)
VoiceAnswerShaper: idempotent post-process for TTS phone formatting

Resulting pipeline:

voice query
  → regex pre-filter        (~1 ms — terminal intents, safety keywords)
  → FAQ regex match         (~1 ms — 5–7 curated entries)
  → RAG (single LLM)        (~4 s — embed + retrieve + answer)
  → VoiceAnswerShaper       (~5 ms — phone formatting)
TTS

Latency target: ≤ 4 s p50 to first chunk, vs current ~10–13 s. Drift surface: 1 system prompt + 1 FAQ table + 1 corpus, vs current ~5 prompt files + 8 stages.

Migration Plan

This is not a rip-out today. Three deliberate phases, gated on metrics.

Phase A — Define + harden the thin path (1 week)

Write a thin voice orchestrator as a new code path under a feature flag (VOICE_THIN_PIPELINE_ENABLED), default off.
Wire it as an A/B branch in public_websocket.py: 50 % of voice sessions get thin, 50 % get current. Track p50 / p95 time-to-first-chunk, filler-ladder fire rate, voice-turn evaluator scores, and caller-perceived completion (Golden eval voice subset).

Phase B — Validate (1 week of A/B traffic)

Pass criteria — thin must match or beat current on:

Voice eval scores within ±2 percentage points
Filler-ladder fire rate ≤ 50 % of current
p50 latency ≤ 6 s (vs ~10 s baseline)
Zero increase in safety incidents

Phase C — Migrate (1 week)

If Phase B passes: flip default to thin, mark dialogue manager + CLAM preprocessor stages as deprecated, then remove dead stages two releases later (after a 2-week stability window in production).

If Phase B fails: document which stage carried the load that thin couldn't replicate, retain that stage, collapse the others, and update this ADR before re-attempting.

Consequences

Positive

~60 % latency reduction (10–13 s → ~4 s) on every voice turn
One system prompt is the source of routing truth — instead of five LLM-prompt files that can drift relative to each other
Two LLM calls saved per turn — ~$0.001 saved per turn × 25 K turns/month ≈ $25/year cost reduction. Trivial dollar saving, meaningful complexity saving.
Drift surface collapses: every regression in the past 24 hours was a layer-mismatch problem. Fewer layers = fewer mismatches.
The corpus becomes the answer. RAG with strong embeddings + history is what hospital public-website chatbots actually need — not a 6-tool dialogue manager designed for harder agent tasks.

Negative

Loses explicit tool-selection visibility. The current dialogue manager emits tool_called in logs; a single RAG call doesn't separate intent from retrieval. Mitigation: log the regex-pre-filter outcome + retrieval similarity scores instead.
Refusal logic shifts from a deterministic gate to a system prompt instruction. Mitigation: keep the legacy intent classifier's out-of-scope detection as a parallel post-check on the generated answer. Cheap (~50 ms), defense-in-depth.
Multi-turn frustration tracking goes away. Mitigation: replicate it with a stateless per-turn signal counter in the regex pre-filter.
Migration is non-trivial — A/B infrastructure + thin orchestrator + ~3 weeks of validated effort.

Neutral

Existing RAG service unchanged — the corpus, ingest pipeline, and golden eval are all preserved.
VoiceAnswerShaper unchanged — already does the right TTS shaping job.
FAQ table unchanged — already regex-driven; just moves earlier in the pipeline.

Alternatives Considered

Alternative 1: Keep current architecture, harden the seams

Rejected. Three drift incidents in six weeks; the cost of NOT simplifying compounded with every feature added. The hospital public-website surface didn't justify the architecture's surface area.

Alternative 2: Replace dialogue manager with native OpenAI tool-calling

Rejected for the content-answering surface. Tool-calling is the right answer for agent problems (multi-step actions over external services); voice search over a curated corpus is closer to "ask the documents" than "execute a workflow". (Note: this judgement was reversed by ADR-0051 one week later, after pilot calls surfaced gibberish-input and language-request scenarios that benefited from LLM-driven recovery.)

Alternative 3: Rip out everything but RAG today, no A/B

Rejected. No safety net. Hospital surface = bad place to experiment without controlled comparison.

References

Drift incidents that motivated this ADR: 2026-04-30 dev-loop session (compose override / dialogue mis-routing / phone TTS / dim mismatch / redundant intent classifier). All five resolved separately, all five symptoms of the same architectural complexity.
ADR-0033 (superseded): BGE-M3 via Ollama
ADR-0048: text-embedding-3-large (foundation this ADR built on)
ADR-0051: the successor decision that replaced the thin pipeline with the agentic VoiceLLMOrchestrator.
Lewis et al., 2020 — original RAG paper; cited as the architectural precedent the thin pipeline collapses to.

Context​

Decision​

Migration Plan​

Phase A — Define + harden the thin path (1 week)​

Phase B — Validate (1 week of A/B traffic)​

Phase C — Migrate (1 week)​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

Alternative 1: Keep current architecture, harden the seams​

Alternative 2: Replace dialogue manager with native OpenAI tool-calling​

Alternative 3: Rip out everything but RAG today, no A/B​

References​