Voice Safety Architecture

This page describes the safety controls active on the voice channel today. It supersedes the earlier "triple-defense" framing: two of the three modules referenced in the original document — stt_ambiguity_guardrail.py and voice_safety_gate.py — were deleted with the legacy 8-stage VoiceOrchestrator pipeline in commit 158d793 (2026-05-02). The voice path now consists of a regex pre-filter, an agentic LLM with tool-grounded retrieval, a regex post-filter, and a re-activated post-LLM disclaimer prepender (Wave 2.C Decision 2, 2026-05-09). Decision lineage: ADR-0049 (thin-pipeline rationale) and ADR-0051 (agentic-only orchestrator).

{/* TODO Wave 2.D: redraw safety architecture diagram to show the new disclaimer step (post-LLM, post-shape) */}

The threat-model framing throughout this page draws on the OWASP LLM Top 10 (@owasp_llm_top10) practitioner taxonomy — in particular LLM01 (prompt injection), LLM03 (training-data-induced misinformation), and LLM06 (sensitive information disclosure) — adapted to the voice-medical-advice surface.

Why voice is a different threat surface

The text channel's safety story benefits from three properties that voice removes:

A visible disclaimer is re-readable. A spoken caveat is single-shot — mis-heard, drowned out, or skipped over by an impatient or hard-of-hearing caller can never be re-checked. Elderly callers, a primary demographic, are also disproportionately likely to miss a fast-spoken caveat.
STT errors can invert intent. Dutch "Hoe wordt migraine behandeld?" (third-person passive, clearly informational) and "Behandel ik migraine?" (first-person imperative, clearly advice-seeking) differ by two phonemes. A Flemish-tuned STT model errs on the close-call side, so a caller asking the safe form can have their transcript rewritten into the advice-seeking form before it reaches the classifier.
The caller cannot inspect citations. The text channel's safety story benefits from caller-auditable source URLs. Voice removes that audit trail — if the agent hallucinates, the caller has no way to detect it. Citation discipline therefore moves from a user-visible safeguard to an audit-trail-only one (still recorded in app.conversation_messages.citations but never rendered to the speaker).

What is actually deployed

Stage 1 — Pre-LLM regex pre-filter

Module: backend/app/services/voice/voice_thin_pre_filter.py, function classify_terminal().

Every caller utterance is run through a deterministic regex classifier before the LLM is invoked. The classifier returns one of seven TerminalClass values:

Safety-refusal patterns

The SAFETY_REFUSAL class is the safety-critical branch. It absorbs the role formerly played by the standalone stt_ambiguity_guardrail.py and voice_safety_gate.py modules. Its patterns target medical-advice phrasings narrowly enough that benign navigational queries (such as "where can I find information about X?") fall through to the agentic path.

The Dutch matcher in voice_thin_pre_filter.py:170 (_SAFETY_PATTERNS["nl"]):

re.compile(
    r"\bhoeveel\b.{0,40}\b(?:nemen|innemen)\b"
    r"|\bwelke\s+(?:medicatie|pil|medicijn|dosis)\b"
    r"|\bwelk\s+medicijn\b"
    r"|\bwelke\s+dosis\b",
    re.IGNORECASE,
)

Equivalent patterns exist for English, French, and Italian (covering "how much should I take", "combien dois-je prendre", "quanto devo prendere", and similar). The 40-character bounded gap on the Dutch hoeveel … nemen pattern caps blast radius — a sentence with 41+ characters between hoeveel and nemen is more likely to be a navigational question than a dosage ask.

When SAFETY_REFUSAL fires, the orchestrator returns the language-matched fixed response from _SAFETY_RESPONSES in voice_llm_orchestrator.py:226, offering the helpdesk transfer plus a fallback to the caller's GP / out-of-hours service / 112. The LLM is never invoked. This is the system's hardest safety guarantee: a recognised dosage ask cannot reach the LLM at all.

Stage 2 — Agentic LLM with tool-grounded retrieval

Module: backend/app/services/voice/voice_llm_orchestrator.py, class VoiceLLMOrchestrator (ADR-0051).

If the pre-filter returns FALLTHROUGH, the orchestrator calls GPT-4.1 with a system prompt and a three-tool schema:

Tool	Purpose	Safety role
`search_hospital_kb`	Wraps `RAGService.query` and returns a voice-shaped answer + citations + a `found` boolean	Forces the agent's claims to be grounded in retrieved chunks rather than invented from training data
`transfer_to_helpdesk`	Short-circuits to a SIP REFER-style escalation	Allows the agent to bail out when it can't safely answer
`end_call`	Closes the call on a farewell	Does not produce content; only ends the turn

Three controls keep the LLM honest:

System-prompt invariants. The system prompt (build_voice_llm_orchestrator_system_prompt in app.prompts) establishes three hard rules: never answer ZOL-specific facts from training data, never give medical advice, and keep responses to one or two sentences. Per the OWASP LLM Top 10, this is LLM01 (prompt injection) mitigation by design — explicit invariants restated at every turn.
Tool-grounded retrieval. Setting tool_choice="auto" and shaping the prompt to require search_hospital_kb for any factual claim means the LLM's answer must trace back to a chunk in app.document_chunks. When the search returns found=False twice consecutively, the orchestrator force-transfers to the helpdesk (voice_llm_orchestrator.py:519-556); this defends against the gibberish-rephrase loop pattern from the 2026-05-07 traffic.
Iteration cap. The tool loop is bounded by voice_llm_orchestrator_max_tool_iterations. On overflow the orchestrator emits a fixed transfer text rather than continuing to spend tokens.

Stage 3 — Post-LLM regex safety post-filter

Module: backend/app/services/voice/voice_llm_orchestrator.py, method _safety_post_filter() (line 844).

After the LLM has produced its final text response (or the loop has fallen through to a fallback transfer), the post-filter runs the response through _MEDICAL_ADVICE_RE (defined in voice_llm_orchestrator.py:181-215). The regex covers three classes of medical-advice slip across Dutch, English, French, and Italian:

Class	Example pattern (Dutch)	What it catches
Diagnosis commitment	`\bje\s+hebt\s+(?:waarschijnlijk\s+)?(?:de\s+)?(?:griep\|covid\|diabetes\|kanker\|astma)\b`	"u heeft (waarschijnlijk) X" — the agent committing to a specific diagnosis
Dosage / drug recommendation	`\b\d+\s*(?:mg\|milligram\|gram\|ml)\b`, `\bneem(?:t)?\s+(?:\d+\|een\|twee\|drie)\s+(?:keer\|tablet\|pil\|capsule)\b`	Numeric dosages, "neem twee tabletten"
First-aid prescription	`\b(?:druk\|leg\|spuit)\s+(?:stevig\s+)?(?:op\|over\|tegen)\s+(?:de\s+)?(?:wond\|huid)\b`	First-aid imperatives commanding the caller to act on themselves

On any match, the post-filter logs voice_llm_post_filter_triggered at WARNING and replaces the LLM output with the language-matched _SAFETY_RESPONSES template (the same template used by Stage 1's safety-refusal branch). This is the belt-and-braces guarantee: even if the LLM ignores the system prompt, the regex strips the offending content before TTS.

Layer interaction matrix

The two regex stages and the LLM stage defend against partially-overlapping failure modes:

Failure mode	Stage 1 (pre-filter)	Stage 2 (agentic LLM)	Stage 3 (post-filter)
Caller asks for dosage / prescription	catches via `_SAFETY_PATTERNS` (no LLM call)	—	catches if Stage 1 missed and the LLM tried to comply
Caller asks "should I do X" with medical entity	catches when phrased with the dosage / drug regex	system prompt + tool-required mode keep the answer navigational	catches diagnostic commitments in the output
Caller asks for a doctor recommendation	falls through to LLM	tool-grounded answer cites doctor-list chunks; navigational by construction	—
LLM hallucinates a dosage from training data	—	tool requirement makes the hallucination a high-friction path	catches via numeric-dosage regex
LLM commits to a diagnosis	—	system prompt explicitly forbids it	catches via diagnosis-commitment regex
Caller asks "verbind me door"	catches as `HANDOFF_REQUEST` (escalates without LLM)	—	—
Adversarial prompt-injection attempt in transcript	partial — only safety / handoff / farewell phrasings are pinned	OWASP LLM01 mitigation via system-prompt invariants and tool grounding	partial — only catches medical-advice-shaped output

Notably absent vs the legacy "triple-defense" model: there is no independent confidence-threshold gate. The deleted voice_safety_gate.py module short-circuited LLM generation when min(intent_confidence, retrieval_confidence) < 0.80. In the current pipeline retrieval confidence is implicit (handled inside RAGService and reflected in the found boolean of the search_hospital_kb tool), and intent confidence is no longer computed because the agentic LLM does not depend on a discrete intent classification.

Stage 4 — Post-LLM medical-content disclaimer prepender

Module: backend/app/services/voice/voice_answer_shaper.py, helper _detect_medical_content_in_answer() (introduced 2026-05-09).

voice_answer_shaper.py carries the per-language disclaimer prepend (Ter informatie, dit is geen medisch advies — … and the en/fr/it equivalents in app.prompts.get_voice_disclaimer). Wave 2.C Decision 2 re-activated the prepender after the bug audit on 2026-05-09 surfaced that voice_llm_orchestrator.py:659 was hard-coding medical_intent_detected=False on every voice turn — a wire left dangling when commit 158d793 deleted the legacy intent classifier.

The mechanism is post-LLM answer-text inspection. The shaper looks at the assistant's actual answer (after RAG, after the LLM, after markdown / URL / citation strip) and decides whether to prepend the disclaimer. This is structurally stronger than pre-LLM intent guessing — the regex evaluates what the system is about to say, not what we predicted the caller meant.

Detection mechanism

Each language carries its own pattern pack covering six clusters of medical vocabulary:

Body / condition / disease / injury — aandoening, kanker, infectie, breuk (nl) and equivalents.
Symptoms — koorts, hoofdpijn, pijn, migraine, vermoeidheid (nl) and equivalents.
Treatment / therapy / medication / surgery — behandeling, medicatie, operatie, revalidatie (nl) and equivalents.
Diagnostic vocabulary — diagnose, MRI, CT, bloedonderzoek (nl) and equivalents.
Specialist roles — cardioloog, chirurg, huisarts, pediater (nl) and equivalents (bare arts excluded — too ambiguous).
Care-domain names — cardiologie, oncologie, neurologie, pediatrie, gynaecologie (nl) and the full Dutch hospital-domain set; equivalents in en/fr/it.

A separate anti-list of purely-navigational vocabulary (parkeren, bezoekuren, telefoonnummer, openingstijden, route per language) is preserved in _NAVIGATIONAL_PATTERNS for documentation and possible future use by a stricter classifier. The current logic is intentionally medical-dominant: any medical-pattern hit triggers the disclaimer regardless of co-occurring navigational vocabulary. Mixed answers ("Cardiology is on floor four, parking is in P3") fire the disclaimer — the safe direction. The regulatory cost of under-disclaim (AI Act Article 50(2)) far outweighs the UX cost of over-disclaim.

Operator signal

Each invocation logs the disclaimer decision at INFO so operators can spot over-firing or under-firing from logs alone:

voice_disclaimer_decision language=nl detected=True prepend=True

This is the R1 silent-failure-discipline log line — a collection-returning function (the implicit "did we prepend?" decision) that emits its result on every turn instead of failing quietly. Per-turn cardinality is ~1 per LLM completion; the volume is fine for the pilot.

Caller-asserted override

The shape() method retains the medical_intent_detected: bool = False parameter for backward compatibility. Semantics changed in 2026-05-09:

False (default) — "I don't know; you decide." The shaper auto-detects via the regex packs.
True — "I know this is medical; prepend it." The shaper honors the assertion and skips auto-detection.

The orchestrator's call site (voice_llm_orchestrator.py:656-662) now passes the default and lets the shaper decide.

Layer-matrix update

Failure mode	Stage 1 (pre-filter)	Stage 2 (LLM)	Stage 3 (post-filter)	Stage 4 (disclaimer)
LLM produces a navigational answer about a condition	—	navigational-by-construction	—	prepends disclaimer
LLM produces a treatment overview	—	tool-grounded by RAG citation	—	prepends disclaimer
LLM produces a parking / hours answer	—	navigational	—	silent (no medical pattern)
LLM commits to a diagnosis	—	system-prompt invariant	catches via diagnosis-commitment regex	(replaced by refusal template)

Evaluation

The voice golden seed (backend/app/evaluation/data/voice_golden_seed.json, 30 questions) contains explicit out-of-scope-medical-advice entries per non-nl language plus STT-ambiguity traps. The VoiceEvaluator reports per-class outcomes (safety_refusal should match for medical-advice seeds; escalate should match for handoff seeds). A regression on the regex pre-filter surfaces as a spike in safety_refusal false-negatives on the seed run.

The voice_llm_post_filter_triggered log line is the operational signal for Stage 3. Per-tenant Prometheus counters are in scope for a Phase B follow-up; the per-call structured logs are sufficient for the pilot.

References

OWASP Foundation. OWASP Top 10 for Large Language Model Applications. See @owasp_llm_top10 — practitioner taxonomy of LLM-application threats; we apply LLM01 (prompt injection), LLM03 (training-data-induced misinformation), and LLM06 (sensitive information disclosure) to the medical-advice safety surface.
Clark & Brennan, "Grounding in Communication" (1991) — the audit-trail argument that citations are a safety artifact, not a display detail.
European Hospital Federation, "Patient Safety in Telephone Triage" (2019) — empirical basis for treating voice medical advice as a distinct risk surface.
Regulation (EU) 2024/1689, "AI Act" Article 50(2) — transparency-for-voice obligations are discharged both at the greeting layer (the "we are an information assistant" statement) AND per-turn via the Stage 4 disclaimer prepender, which fires on any answer matching the medical-content pattern packs in _detect_medical_content_in_answer() (backend/app/services/voice/voice_answer_shaper.py).

Why voice is a different threat surface​

What is actually deployed​

Stage 1 — Pre-LLM regex pre-filter​

Safety-refusal patterns​

Stage 2 — Agentic LLM with tool-grounded retrieval​

Stage 3 — Post-LLM regex safety post-filter​

Layer interaction matrix​

Stage 4 — Post-LLM medical-content disclaimer prepender​

Detection mechanism​

Operator signal​

Caller-asserted override​

Layer-matrix update​

Evaluation​

References​