Voice Safety Architecture
This page describes the safety controls active on the voice channel today. It supersedes the earlier "triple-defense" framing: two of the three modules referenced in the original document — stt_ambiguity_guardrail.py and voice_safety_gate.py — were deleted with the legacy 8-stage VoiceOrchestrator pipeline in commit 158d793 (2026-05-02). The voice path now consists of a regex pre-filter, an agentic LLM with tool-grounded retrieval, a regex post-filter, and a re-activated post-LLM disclaimer prepender (Wave 2.C Decision 2, 2026-05-09). Decision lineage: ADR-0049 (thin-pipeline rationale) and ADR-0051 (agentic-only orchestrator).
{/* TODO Wave 2.D: redraw safety architecture diagram to show the new disclaimer step (post-LLM, post-shape) */}
The threat-model framing throughout this page draws on the OWASP LLM Top 10 (@owasp_llm_top10) practitioner taxonomy — in particular LLM01 (prompt injection), LLM03 (training-data-induced misinformation), and LLM06 (sensitive information disclosure) — adapted to the voice-medical-advice surface.
Why voice is a different threat surface
The text channel's safety story benefits from three properties that voice removes:
- A visible disclaimer is re-readable. A spoken caveat is single-shot — mis-heard, drowned out, or skipped over by an impatient or hard-of-hearing caller can never be re-checked. Elderly callers, a primary demographic, are also disproportionately likely to miss a fast-spoken caveat.
- STT errors can invert intent. Dutch
"Hoe wordt migraine behandeld?"(third-person passive, clearly informational) and"Behandel ik migraine?"(first-person imperative, clearly advice-seeking) differ by two phonemes. A Flemish-tuned STT model errs on the close-call side, so a caller asking the safe form can have their transcript rewritten into the advice-seeking form before it reaches the classifier. - The caller cannot inspect citations. The text channel's safety story benefits from caller-auditable source URLs. Voice removes that audit trail — if the agent hallucinates, the caller has no way to detect it. Citation discipline therefore moves from a user-visible safeguard to an audit-trail-only one (still recorded in
app.conversation_messages.citationsbut never rendered to the speaker).
What is actually deployed
Stage 1 — Pre-LLM regex pre-filter
Module: backend/app/services/voice/voice_thin_pre_filter.py, function classify_terminal().
Every caller utterance is run through a deterministic regex classifier before the LLM is invoked. The classifier returns one of seven TerminalClass values:
Safety-refusal patterns
The SAFETY_REFUSAL class is the safety-critical branch. It absorbs the role formerly played by the standalone stt_ambiguity_guardrail.py and voice_safety_gate.py modules. Its patterns target medical-advice phrasings narrowly enough that benign navigational queries (such as "where can I find information about X?") fall through to the agentic path.
The Dutch matcher in voice_thin_pre_filter.py:170 (_SAFETY_PATTERNS["nl"]):
re.compile(
r"\bhoeveel\b.{0,40}\b(?:nemen|innemen)\b"
r"|\bwelke\s+(?:medicatie|pil|medicijn|dosis)\b"
r"|\bwelk\s+medicijn\b"
r"|\bwelke\s+dosis\b",
re.IGNORECASE,
)
Equivalent patterns exist for English, French, and Italian (covering "how much should I take", "combien dois-je prendre", "quanto devo prendere", and similar). The 40-character bounded gap on the Dutch hoeveel … nemen pattern caps blast radius — a sentence with 41+ characters between hoeveel and nemen is more likely to be a navigational question than a dosage ask.
When SAFETY_REFUSAL fires, the orchestrator returns the language-matched fixed response from _SAFETY_RESPONSES in voice_llm_orchestrator.py:226, offering the helpdesk transfer plus a fallback to the caller's GP / out-of-hours service / 112. The LLM is never invoked. This is the system's hardest safety guarantee: a recognised dosage ask cannot reach the LLM at all.
Stage 2 — Agentic LLM with tool-grounded retrieval
Module: backend/app/services/voice/voice_llm_orchestrator.py, class VoiceLLMOrchestrator (ADR-0051).
If the pre-filter returns FALLTHROUGH, the orchestrator calls GPT-4.1 with a system prompt and a three-tool schema:
| Tool | Purpose | Safety role |
|---|---|---|
search_hospital_kb | Wraps RAGService.query and returns a voice-shaped answer + citations + a found boolean | Forces the agent's claims to be grounded in retrieved chunks rather than invented from training data |
transfer_to_helpdesk | Short-circuits to a SIP REFER-style escalation | Allows the agent to bail out when it can't safely answer |
end_call | Closes the call on a farewell | Does not produce content; only ends the turn |
Three controls keep the LLM honest:
- System-prompt invariants. The system prompt (
build_voice_llm_orchestrator_system_promptinapp.prompts) establishes three hard rules: never answer ZOL-specific facts from training data, never give medical advice, and keep responses to one or two sentences. Per the OWASP LLM Top 10, this is LLM01 (prompt injection) mitigation by design — explicit invariants restated at every turn. - Tool-grounded retrieval. Setting
tool_choice="auto"and shaping the prompt to requiresearch_hospital_kbfor any factual claim means the LLM's answer must trace back to a chunk inapp.document_chunks. When the search returnsfound=Falsetwice consecutively, the orchestrator force-transfers to the helpdesk (voice_llm_orchestrator.py:519-556); this defends against the gibberish-rephrase loop pattern from the 2026-05-07 traffic. - Iteration cap. The tool loop is bounded by
voice_llm_orchestrator_max_tool_iterations. On overflow the orchestrator emits a fixed transfer text rather than continuing to spend tokens.
Stage 3 — Post-LLM regex safety post-filter
Module: backend/app/services/voice/voice_llm_orchestrator.py, method _safety_post_filter() (line 844).
After the LLM has produced its final text response (or the loop has fallen through to a fallback transfer), the post-filter runs the response through _MEDICAL_ADVICE_RE (defined in voice_llm_orchestrator.py:181-215). The regex covers three classes of medical-advice slip across Dutch, English, French, and Italian:
| Class | Example pattern (Dutch) | What it catches |
|---|---|---|
| Diagnosis commitment | \bje\s+hebt\s+(?:waarschijnlijk\s+)?(?:de\s+)?(?:griep|covid|diabetes|kanker|astma)\b | "u heeft (waarschijnlijk) X" — the agent committing to a specific diagnosis |
| Dosage / drug recommendation | \b\d+\s*(?:mg|milligram|gram|ml)\b, \bneem(?:t)?\s+(?:\d+|een|twee|drie)\s+(?:keer|tablet|pil|capsule)\b | Numeric dosages, "neem twee tabletten" |
| First-aid prescription | \b(?:druk|leg|spuit)\s+(?:stevig\s+)?(?:op|over|tegen)\s+(?:de\s+)?(?:wond|huid)\b | First-aid imperatives commanding the caller to act on themselves |
On any match, the post-filter logs voice_llm_post_filter_triggered at WARNING and replaces the LLM output with the language-matched _SAFETY_RESPONSES template (the same template used by Stage 1's safety-refusal branch). This is the belt-and-braces guarantee: even if the LLM ignores the system prompt, the regex strips the offending content before TTS.
Layer interaction matrix
The two regex stages and the LLM stage defend against partially-overlapping failure modes:
| Failure mode | Stage 1 (pre-filter) | Stage 2 (agentic LLM) | Stage 3 (post-filter) |
|---|---|---|---|
| Caller asks for dosage / prescription | catches via _SAFETY_PATTERNS (no LLM call) | — | catches if Stage 1 missed and the LLM tried to comply |
| Caller asks "should I do X" with medical entity | catches when phrased with the dosage / drug regex | system prompt + tool-required mode keep the answer navigational | catches diagnostic commitments in the output |
| Caller asks for a doctor recommendation | falls through to LLM | tool-grounded answer cites doctor-list chunks; navigational by construction | — |
| LLM hallucinates a dosage from training data | — | tool requirement makes the hallucination a high-friction path | catches via numeric-dosage regex |
| LLM commits to a diagnosis | — | system prompt explicitly forbids it | catches via diagnosis-commitment regex |
| Caller asks "verbind me door" | catches as HANDOFF_REQUEST (escalates without LLM) | — | — |
| Adversarial prompt-injection attempt in transcript | partial — only safety / handoff / farewell phrasings are pinned | OWASP LLM01 mitigation via system-prompt invariants and tool grounding | partial — only catches medical-advice-shaped output |
Notably absent vs the legacy "triple-defense" model: there is no independent confidence-threshold gate. The deleted voice_safety_gate.py module short-circuited LLM generation when min(intent_confidence, retrieval_confidence) < 0.80. In the current pipeline retrieval confidence is implicit (handled inside RAGService and reflected in the found boolean of the search_hospital_kb tool), and intent confidence is no longer computed because the agentic LLM does not depend on a discrete intent classification.
Stage 4 — Post-LLM medical-content disclaimer prepender
Module: backend/app/services/voice/voice_answer_shaper.py, helper _detect_medical_content_in_answer() (introduced 2026-05-09).
voice_answer_shaper.py carries the per-language disclaimer prepend (Ter informatie, dit is geen medisch advies — … and the en/fr/it equivalents in app.prompts.get_voice_disclaimer). Wave 2.C Decision 2 re-activated the prepender after the bug audit on 2026-05-09 surfaced that voice_llm_orchestrator.py:659 was hard-coding medical_intent_detected=False on every voice turn — a wire left dangling when commit 158d793 deleted the legacy intent classifier.
The mechanism is post-LLM answer-text inspection. The shaper looks at the assistant's actual answer (after RAG, after the LLM, after markdown / URL / citation strip) and decides whether to prepend the disclaimer. This is structurally stronger than pre-LLM intent guessing — the regex evaluates what the system is about to say, not what we predicted the caller meant.
Detection mechanism
Each language carries its own pattern pack covering six clusters of medical vocabulary:
- Body / condition / disease / injury —
aandoening,kanker,infectie,breuk(nl) and equivalents. - Symptoms —
koorts,hoofdpijn,pijn,migraine,vermoeidheid(nl) and equivalents. - Treatment / therapy / medication / surgery —
behandeling,medicatie,operatie,revalidatie(nl) and equivalents. - Diagnostic vocabulary —
diagnose,MRI,CT,bloedonderzoek(nl) and equivalents. - Specialist roles —
cardioloog,chirurg,huisarts,pediater(nl) and equivalents (bareartsexcluded — too ambiguous). - Care-domain names —
cardiologie,oncologie,neurologie,pediatrie,gynaecologie(nl) and the full Dutch hospital-domain set; equivalents in en/fr/it.
A separate anti-list of purely-navigational vocabulary (parkeren, bezoekuren, telefoonnummer, openingstijden, route per language) is preserved in _NAVIGATIONAL_PATTERNS for documentation and possible future use by a stricter classifier. The current logic is intentionally medical-dominant: any medical-pattern hit triggers the disclaimer regardless of co-occurring navigational vocabulary. Mixed answers ("Cardiology is on floor four, parking is in P3") fire the disclaimer — the safe direction. The regulatory cost of under-disclaim (AI Act Article 50(2)) far outweighs the UX cost of over-disclaim.
Operator signal
Each invocation logs the disclaimer decision at INFO so operators can spot over-firing or under-firing from logs alone:
voice_disclaimer_decision language=nl detected=True prepend=True
This is the R1 silent-failure-discipline log line — a collection-returning function (the implicit "did we prepend?" decision) that emits its result on every turn instead of failing quietly. Per-turn cardinality is ~1 per LLM completion; the volume is fine for the pilot.
Caller-asserted override
The shape() method retains the medical_intent_detected: bool = False parameter for backward compatibility. Semantics changed in 2026-05-09:
False(default) — "I don't know; you decide." The shaper auto-detects via the regex packs.True— "I know this is medical; prepend it." The shaper honors the assertion and skips auto-detection.
The orchestrator's call site (voice_llm_orchestrator.py:656-662) now passes the default and lets the shaper decide.
Layer-matrix update
| Failure mode | Stage 1 (pre-filter) | Stage 2 (LLM) | Stage 3 (post-filter) | Stage 4 (disclaimer) |
|---|---|---|---|---|
| LLM produces a navigational answer about a condition | — | navigational-by-construction | — | prepends disclaimer |
| LLM produces a treatment overview | — | tool-grounded by RAG citation | — | prepends disclaimer |
| LLM produces a parking / hours answer | — | navigational | — | silent (no medical pattern) |
| LLM commits to a diagnosis | — | system-prompt invariant | catches via diagnosis-commitment regex | (replaced by refusal template) |
Evaluation
The voice golden seed (backend/app/evaluation/data/voice_golden_seed.json, 30 questions) contains explicit out-of-scope-medical-advice entries per non-nl language plus STT-ambiguity traps. The VoiceEvaluator reports per-class outcomes (safety_refusal should match for medical-advice seeds; escalate should match for handoff seeds). A regression on the regex pre-filter surfaces as a spike in safety_refusal false-negatives on the seed run.
The voice_llm_post_filter_triggered log line is the operational signal for Stage 3. Per-tenant Prometheus counters are in scope for a Phase B follow-up; the per-call structured logs are sufficient for the pilot.
References
- OWASP Foundation. OWASP Top 10 for Large Language Model Applications. See @owasp_llm_top10 — practitioner taxonomy of LLM-application threats; we apply LLM01 (prompt injection), LLM03 (training-data-induced misinformation), and LLM06 (sensitive information disclosure) to the medical-advice safety surface.
- Clark & Brennan, "Grounding in Communication" (1991) — the audit-trail argument that citations are a safety artifact, not a display detail.
- European Hospital Federation, "Patient Safety in Telephone Triage" (2019) — empirical basis for treating voice medical advice as a distinct risk surface.
- Regulation (EU) 2024/1689, "AI Act" Article 50(2) — transparency-for-voice obligations are discharged both at the greeting layer (the "we are an information assistant" statement) AND per-turn via the Stage 4 disclaimer prepender, which fires on any answer matching the medical-content pattern packs in
_detect_medical_content_in_answer()(backend/app/services/voice/voice_answer_shaper.py).