Voice Answer Shaping
The problem
Text-channel RAG answers are optimized for visual reading: 6-sentence explanations with markdown emphasis, bullet lists, citation markers ([1]), URLs, abbreviations (ICU, SEH), and wall-clock times (14:00). A naïve Text-To-Speech (TTS) rendering of such an answer is unintelligible:
- Markdown → pronounced as
"star star"on every**bold**. - URLs → pronounced character-by-character, unbearable on phone.
- Citations → pronounced as
"open bracket one close bracket". - Abbreviations →
"eye see you"for ICU is ambiguous and unprofessional. - Times →
"one four colon zero zero"for 14:00. - Sentence count → 6 spoken sentences exceed ~30 seconds, long beyond the human tolerance for uninterrupted monologue on phone.
The Voice Answer Shaper is a deterministic post-generation transform that converts text-shaped RAG answers into voice-shaped prose. No LLM is involved; all transforms are regex-based and unit-tested.
Voice cognition uses a thin pipeline (regex pre-filter → FAQ tool dispatch → RAG fallback) hosted by an agentic VoiceLLMOrchestrator. The legacy 8-stage pipeline (VoiceOrchestrator, dialogue manager, speculative-STT cache, etc.) was retired in commit 158d793 (2026-05-02). See ADR-0049 for the original thin-pipeline rationale and ADR-0051 for the current agentic-orchestrator decision. The shaper now runs as the last step inside VoiceLLMOrchestrator before returning the response.
Transform pipeline
The six transforms run in strict order, because earlier transforms normalize input for later ones:
Why order matters
Consider the input "Bel **ICU** voor afspraken vóór 14:00." (Dutch, medical information):
| Step | State after |
|---|---|
| 1 markdown strip | "Bel ICU voor afspraken vóór 14:00." |
| 2 URL strip | no change |
| 3 citation strip | no change |
4 abbreviation (word-boundary \bICU\b) | "Bel de intensieve zorgafdeling voor afspraken vóór 14:00." |
5 time spell (nl, 14:00 → "twee uur" via 24→12 hour map) | "Bel de intensieve zorgafdeling voor afspraken vóór twee uur." |
| 6 sentence count | no change (1 sentence) |
| 7 disclaimer (medical) | "Ter informatie, dit is geen medisch advies — Bel de intensieve zorgafdeling voor afspraken vóór twee uur." |
Had step 4 run before step 1, the abbreviation lookup would have missed **ICU** because the bold markers break the word boundary. Had step 5 run before step 4, 14:00 would have been spelled but ICU would still be in the text being counted for sentence length. The ordering invariant is tested by the combined-transforms regression test in test_voice_answer_shaper.py.
Module specifics
Module: backend/app/services/voice/voice_answer_shaper.py.
The public interface is a single dataclass with one method:
@dataclass(slots=True)
class VoiceAnswerShaper:
max_sentences: int = 3
def shape(
self,
answer: str,
language: str = "nl",
medical_intent_detected: bool = False,
) -> tuple[str, bool, dict]:
"""Returns (shaped_answer, voice_shape_compliant, diagnostics)."""
The diagnostics dict carries per-call counters useful for telemetry and eval investigation:
{
"abbreviations_expanded": 2,
"urls_stripped": 0,
"citations_stripped": 3,
"sentences_truncated": False,
}
voice_shape_compliant is a boolean safety-check flag — True iff the final output contains no residual http, no www., no [, no **, and the period count is at most max_sentences + 1 (the +1 accounts for an abbreviation dot in names like "St. Luke's"). Low compliance on real traffic surfaces as a drop in the rag_voice_shape_compliance Prometheus histogram — an operator monitors this metric to detect LLM outputs drifting away from voice-shaped behavior.
Per-language abbreviation tables
| Language | Entries |
|---|---|
nl | ICU → de intensieve zorgafdeling, SEH → de spoedeisende hulp, OK → de operatiekamer |
en | ICU → the intensive care unit, ER → the emergency room, OR → the operating room |
fr | USI → l'unité de soins intensifs, URG → les urgences |
it | UTI → l'unità di terapia intensiva, PS → il pronto soccorso |
Notable omission: English OR (operating room) was excluded from the table despite being the sibling of Dutch OK. During adversarial review, the substring-match lookup was replaced with \b{abbr}\b word-boundary matching, but OR is also a standalone English disjunction ("Either this OR that"). Even with word boundaries, the standalone English disjunction would expand to "the operating room" — e.g., "Call the surgeon OR the nurse" → "Call the surgeon the operating room the nurse". Given that callers say "operating theater" or "surgical suite" far more naturally than "OR" on phone anyway, the table drops OR as the pragmatic choice.
Number and time spell-out
Times in HH:MM format are matched by the regex (?<![\d:])(\d{1,2}):(\d{2})(?![\d:]). The leading/trailing [\d:] exclusions prevent matches inside multi-colon timestamps like 2023:12:31, which surfaced as a critical bug during code review. Invalid hours (≥ 24) or minutes (≥ 60) return the original string unchanged — better to have a stubborn 24:00 in the output than a nonsense spelled-out "twenty-four in the morning".
For the hour word, the Dutch and English tables map 0–23 to either the 12-hour form (13 → "één" / "one", 14 → "twee" / "two") or the 24-hour form where appropriate. Minute words are selectively covered: :00 maps to the hour-only form ("twee uur" / "two in the afternoon"), key quarter-hour positions (:15, :30, :45) map to named minutes ("vijftien", "dertig", "vijfenveertig"), and other minute values fall through to leaving the original HH:MM intact. The principle is that partial spell-out ("nine zero five" for 9:05) is worse than no spell-out at all.
The compliance check
The final compliance check is intentionally conservative:
compliant = (
"http" not in text.lower()
and "www." not in text.lower()
and "[" not in text
and "**" not in text
and text.count(".") <= self.max_sentences + 1
)
A failure here doesn't block the response — the shaped answer still goes out — but voice_shape_compliant=False propagates to the QueryResponse and surfaces in the rag_voice_shape_compliance Prometheus histogram. Operators can set an alert on a compliance-rate drop, which would catch cases where the LLM starts emitting markdown or URLs the transforms didn't anticipate.
Caller-facing sample
Input (text-channel RAG output, hypothetical):
Bezoekuren in ZOL Genk:
De bezoekuren zijn van maandag tot vrijdag, van 14:00 tot 20:00 uur [1]. Voor de ICU gelden andere regels: bezoek is mogelijk op afspraak. Zie onze website https://zol.be/bezoek voor details.
After voice shaping (Dutch, medical intent detected):
Ter informatie, dit is geen medisch advies — Bezoekuren in ZOL Genk. De bezoekuren zijn van maandag tot vrijdag, van twee uur tot acht uur 's avonds. Voor de intensieve zorgafdeling gelden andere regels: bezoek is mogelijk op afspraak.
Changes: disclaimer prepended, bold stripped, URL removed, citation marker removed, 14:00 → twee uur, 20:00 → acht uur 's avonds, ICU → intensieve zorgafdeling, 4th sentence truncated (3-sentence cap). All transforms run in the documented order.
Test coverage
test_voice_answer_shaper.py holds 15 tests:
- 11 per-transform tests (markdown strip, URL strip, citation strip, abbreviation expand, time spell, sentence count, compliance true on clean answer, disclaimer prepend, no disclaimer when non-medical, em-dash preservation).
- 4 regression tests locking in the tightening passes: word-boundary abbreviation (no
ICUxmatch),ORnot expanded, timestamp2023:12:31not transformed, non-zero minutes not leaking digits.
All 15 pass, and the shaper now runs live inside VoiceLLMOrchestrator on the post-generation path (the legacy VoiceOrchestrator it once shadowed was deleted in commit 158d793 on 2026-05-02 — see ADR-0049 and ADR-0051).
References
- Plank & van Noord, "Abbreviation detection for Dutch natural language processing" (2010) — the precision/recall asymmetry on abbreviation expansion is the basis for the word-boundary + per-language approach.
- Taylor, "Text-to-Speech Synthesis" (2009) §6 — the argument for normalizing text before handing to TTS rather than relying on the TTS engine's own text-normalization, which is engine-specific and brittle.
- IBM Developer, "Accessible Voice Interfaces" (2021) — the short-answer and disclaimer-first principles as applied to healthcare voice UX.