Safety Architecture

Critical Constraint

The ZOL Intelligent Search system must NEVER provide medical advice. This is not a preference or a guideline — it is an absolute, non-negotiable safety requirement that permeates every layer of the architecture. The system is positioned as a search tool (zoektool), not a clinical decision support system; this distinction is load-bearing for both medical-ethics duty of care and EU regulatory classification (see EU AI Act Compliance).

The safety imperative

Operating an AI system in the healthcare domain carries asymmetric risk. The cost function is non-linear: a poor product recommendation costs minutes; a mis-stated dosage can cost a life. Four failure classes drive the design:

Failure	Mechanism	Regulatory consequence
Delayed care	A patient receives reassuring-sounding information about symptoms that warrant urgent assessment, and defers contacting a clinician.	Possible breach of duty of care; tort exposure under Belgian medical-liability law.
Mis-direction	Incorrect department or doctor information sends a patient to the wrong specialist.	Service-quality complaint; reputational.
Implicit advice	A response uses imperative ("you should take..."), declarative ("this condition requires..."), or dosage language that a layperson reads as instruction.	Possible MDR (Medical Device Regulation, Regulation (EU) 2017/745) re-classification as software-as-a-medical-device, requiring CE marking.
Liability accretion	The hospital, as the data controller and channel operator, bears legal responsibility for content delivered through its digital surfaces under both general tort law and sector-specific regulation.	Direct hospital exposure independent of vendor liability.

The system answers these by enforcing defense in depth — the principle that no single mechanism is trusted to be correct on its own — adapted to the LLM-application threat model documented in the OWASP LLM Top 10, in particular LLM01 (prompt injection), LLM06 (sensitive information disclosure), and LLM09 (over-reliance). The OWASP framing maps the technical controls to attacker capabilities rather than hand-waving "best practices"; the layered architecture below is the concrete realisation.

Multi-layer architecture

The layers are independent: each is implemented in its own module, configured by an independent feature flag, and tested with its own regression suite. The independence requirement is intentional — a coupled stack collapses to its weakest link, and OWASP LLM01 (prompt injection) attacks specifically target single-point-of-trust designs.

Layer 1: intent classification guard

Implementation: backend/app/services/intent_classification_service.py. Voice-channel equivalent: backend/app/services/voice/voice_thin_pre_filter.py::classify_terminal().

The first line of defence operates before any retrieval or generation. The intent classifier (Tier 2 model) analyses every incoming query and assigns one of twelve intent categories. Four trigger an immediate safety block: out_of_scope_medical_advice, off_topic, other_hospital, vague_input. The remaining eight intents proceed to retrieval and generation.

Trade-offs

Alternative considered	Why rejected
Single regex layer (no LLM classifier)	Regex is fast (sub-millisecond) but cannot resolve the lexical ambiguity that drives most medical-advice slips. "Hoe wordt migraine behandeld?" (third-person passive, navigational) and "Behandel ik migraine?" (first-person imperative, advice-seeking) differ by two phonemes and are syntactically very close; only an LLM with semantic context distinguishes the two reliably. We keep the regex (Layer 3) as a complement, not a replacement.
Single LLM safety judge (no early classifier)	A single late-stage judge means every benign query pays the LLM-judge latency tax (~500–800 ms) and the unsafe queries already received compute on retrieval + generation that they were never going to use. Early classification short-circuits both costs.
Block all symptom-mentioning queries unconditionally	False-positive rate would be catastrophic. "Waar is de afdeling cardiologie?" mentions a condition class but is purely navigational. Patients would be funnelled to the helpdesk for routine wayfinding.

Safety message

When medical advice is detected, the system returns a standardised Dutch message:

"Ik kan geen medisch advies geven. Voor medische vragen kunt u contact opnemen met uw huisarts of met ZOL via 089 32 50 50. Ik help u graag met informatie over onze afdelingen, artsen, en diensten."

(I cannot provide medical advice. For medical questions, contact your general practitioner or ZOL at 089 32 50 50. I am happy to help with information about our departments, doctors, and services.)

The message: states the limitation explicitly; redirects to a clinician (general practitioner first, hospital phone as fallback); offers continued help on informational queries; cites the hospital's main phone number as a verifiable escape hatch. Per GDPR Art. 22 (automated decision-making), the redirect is structured as a helpful re-route rather than a denial of service — the user is never left without a path forward.

Refuse-and-link (June 2026 — policy reversal)

A live jury demonstration on 2026-06-01 surfaced a real failure of the original refuse-everything policy: asked for an 11-year-old's ibuprofen dose, the system fabricated one by extrapolating past the end of a brochure's dosing table — and the answer carried a citation, so a citation-presence audit had rated it grounded. It was not grounded; it was invented from a real source's structure.

The medical-advice policy was reversed accordingly. The block branch no longer dead-ends; it refuses the advice and links the authoritative brochure:

The patient is never given a dose. They are given the document they are entitled to read, plus the clinician to consult.

This is enforced by a hybrid medical-claim guard (app/services/safety_service.py, behind medical_claim_classifier_enabled):

A regex floor catches numeric dose claims deterministically (no false-negatives on dosing).
A gpt-4.1-nano classifier, gated to fire only on medical-adjacent intents, catches the prose failures regex cannot — causation ("X causes Y"), diagnosis, and false-reassurance ("many patients in your situation find that…"). It never runs on navigational traffic.

A whole-branch adversarial review caught a streaming leak (unsafe tokens streamed before the final frame could override — fixed by a final-frame override that preserves streaming for safe answers) and a cache-bypass (a previously-cached unsafe answer skipping the new guard). The lesson is load-bearing: a citation is necessary but not sufficient for grounding — grounding must be checked against the cited table's domain, not the document's existence.

Input-side intent lanes (June 2026)

A new class of control entered the architecture in June 2026: input-side intent lanes that fire before retrieval, parallel to Layer 1, short-circuiting generation entirely for a small, well-defined set of high-stakes inputs. Each lane follows the same three-part shape:

deterministic floor  →  narrow nano classifier  →  localised deterministic renderer
(patterned inputs)      (prose-only, fail-safe)     (no LLM in the output path)

The renderer is fully deterministic, so the response on these lanes can never itself drift into advice. Two lanes are live:

Lane	Module	Fires on	Routes to
Emergency escalation	`app/services/safety/emergency_triage.py`	Acute red-flag presentations (chest pain, stroke signs, self-harm, …)	Immediate escalation to emergency services / helpline, never a brochure lookup
Sensitive-identity correction	`app/services/safety/identity_correction.py`	A protected identity (sexual orientation / gender identity) framed as an illness to be cured	A non-pathologising correction → Psychologie / Seksuologie / Gendercentrum — never psychiatry-by-default; never names a doctor

The design principle across both: deterministic where the input is patterned, a tiny classifier only where the input is prose, and always failing in the safe direction. The classifiers are gated narrowly (the identity classifier only runs when the query mentions a protected identity at all) and the floors fire only on grammatical predication, not mere co-occurrence — "I'm gay and I have an anxiety disorder" must not trip the identity lane. See Sensitive-Identity Correction for the full lane, including the self-vs-other classifier and the co-occurrence negatives that gate it.

Both lanes are independently flag-gated (emergency_escalation_enabled, identity_correction_enabled / identity_correction_voice_enabled) and run on both the chat and voice channels.

Layer 2: post-generation regex safety validation

Implementation: backend/app/services/safety_service.py.

Even when the intent is classified as safe, the generated response itself could inadvertently contain medical-advice language — either because the retrieval surfaced a chunk that was permissive or because the generation drifted from the system prompt. Layer 2 scans the generated text for Dutch medical-advice patterns using compiled regular expressions:

Pattern category	Examples
Dosage language	"neem 2 tabletten", "dosering", "mg per dag", "maximaal 2 capsules"
Treatment directives	"u moet", "u dient", "het is noodzakelijk"
Diagnostic language	"u heeft waarschijnlijk", "dit wijst op", "diagnose"
Medication references	"medicatie", "voorschrift", "bijwerkingen"
Medication dosage adjustment	"verhoog uw dosis", "verlaag uw dosis"
Urgency language	"ga onmiddellijk naar", "bel direct", "spoedeisend"

Patterns were developed in consultation with the hospital's communication team to capture the linguistic markers of medical advice in Flemish Dutch. The validation is enabled by default (safety_validation_enabled=true); for a hospital system, safety must be opt-out, never opt-in.

Contextual sensitivity

Some terms (such as "medicatie") appear in legitimate navigational content — "Bring a list of your current medication to your appointment". The regex set therefore matches on phrasings that combine a medication noun with an imperative verb form, rather than on noun mentions alone. False-positive rate is monitored as a deployment KPI (target < 5 %).

Why regex and LLM, not just LLM

A latency / coverage trade-off, made explicit:

Option	p50 latency	Recall on paraphrased advice	Cost per query
Regex only	< 1 ms	High on canonical phrasings; misses paraphrases	Effectively zero
LLM-as-judge only	500–800 ms	High on paraphrases; depends on prompt rigor	$0.0005–0.002
Regex + LLM-as-judge	< 1 ms (most queries) + 500–800 ms when LLM fires	Highest combined recall	Sum of both

Regex catches the high-volume, canonical phrasings cheaply; the LLM judge catches paraphrased and subtle advisory language that pattern matching misses. Running both costs essentially the same as running just the judge, because the regex tier completes long before the LLM call returns.

Layer 2b: LLM-as-judge safety validation

Implementation: backend/app/services/safety_service.py::validate_response_llm().

A second-pass safety review where an LLM (the Tier 2 model) judges whether the generated response contains medical advice. This catches paraphrased and subtle advisory language that regex pattern matching cannot reliably detect — for example, "many patients in your situation find that ..." which avoids any directive vocabulary but functions as advice.

Aspect	Detail
Model	Tier 2 (Anthropic Claude 3 / OpenAI GPT-4.1-mini, configurable)
Mode	Async, fail-closed — timeout or API error blocks the response as a safety precaution
Default	Enabled (`safety_llm_validation_enabled=true`) on voice; configurable on text channel
Configurable	Toggleable via Settings API or `.env` for demos and incident response

Fail-closed guardrails (April 2026)

The guardrails safety check (Llama Guard 3) operates in fail-closed mode: when the guardrails service is enabled but the check fails (timeout, API error, ambiguous response), the query is refused rather than allowed through. This was changed from the previous fail-open behaviour, which could allow unsafe content through during transient API failures. The regex and LLM-as-judge layers were already fail-safe by design; the change brings the guardrails layer to the same standard.

The decision to fail-closed reflects an explicit safety-vs-availability trade-off: under transient outage, the system degrades to a more restrictive posture rather than a more permissive one. The user-visible cost is a small number of false-refusals during outage windows; the cost we avoid is medical content slipping through during the exact periods when monitoring quality is also degraded.

LLM judge prompt rigor

The judge uses a zero-tolerance classification prompt that distinguishes six violation categories: dosage information, specific medication names presented as treatment options, diagnostic statements, treatment plans, triage advice, and self-care instructions. A critical prompt rule is that a disclaimer like "raadpleeg uw arts" does NOT make medical content safe — the judge flags dosage information even when a disclaimer is present, because users read the dose and may ignore the disclaimer.

Layer 3: quality gate

Implementation: see Quality Evaluation.

The quality gate serves a dual purpose: it measures response quality and acts as a confidence check. Responses below the calibrated similarity threshold (0.45 on the current embedding model — see ADR-0048) are more likely to contain hallucinated or poorly grounded content. Low-confidence responses are replaced with a safe fallback that acknowledges the question without attempting an answer.

The quality gate is the explicit over-reliance mitigation per OWASP LLM09: rather than allowing the LLM to attempt every question regardless of grounding, the gate forces the system to admit it doesn't know.

Layer 4: mandatory disclaimers

Every response that passes the previous layers receives a mandatory disclaimer appended at the application level (not by the LLM), so the disclaimer cannot be suppressed by prompt engineering or training-data drift:

"Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50."

(This is not medical advice. For medical questions, contact your general practitioner or call ZOL at 089 32 50 50.)

The injection uses exact-text deduplication — if the disclaimer is already present in the response, it is not appended again. This makes the operation idempotent across multiple processing stages.

Why an injected disclaimer rather than a prompt instruction

The system prompt instructs the LLM to include a disclaimer, and the application appends one regardless. The redundancy is deliberate: prompt instructions are best-effort under OWASP LLM01 (prompt injection) — an attacker who succeeds in prompt injection can suppress them. The injected disclaimer is a control-plane enforcement that operates outside the LLM call boundary and therefore cannot be suppressed by prompt-level attacks.

Zero-tolerance metrics

Metric	Target	Monitoring
Medical advice incidents	ZERO	Automated pattern detection + manual audit
Intent classification accuracy	> 95 %	Evaluation against labelled test set
Safety filter false-positive rate	< 5 %	User feedback analysis
Disclaimer presence	100 %	Automated response scanning

Defense-in-depth verification matrix

The defense-in-depth principle: each layer operates independently, and the failure of any single layer does not compromise overall safety. The table below verifies the property by walking representative scenarios:

Scenario	Layer 1	Layer 2 (Regex)	Layer 2b (LLM judge)	Layer 3	Layer 4	Outcome
Clear medical advice query	Blocks	--	--	--	--	Safe
Safe query, unsafe response	Passes	Blocks	--	--	--	Safe
Subtle advisory language	Passes	Passes	Blocks	--	--	Safe
Safe query, low confidence	Passes	Passes	Passes	Blocks	--	Safe
Safe query, good response	Passes	Passes	Passes	Passes	Adds disclaimer	Safe
Classifier + regex fail	Fails	Fails	Catches	Catches	Adds disclaimer	Safe
All filters fail	Fails	Fails	Fails	Fails	Adds disclaimer	Mitigated

Even in the worst case — where intent classification misses a medical-advice query, both regex and LLM validation fail, and the quality gate scores the response above threshold — the mandatory disclaimer still provides a baseline safety net. This is not a substitute for upstream defences; it is the residual control when those defences fail simultaneously.

Regulatory mapping

The safety architecture aligns with the relevant European regulatory frameworks. Specific articles cited:

Regulation	Specific provision	How the architecture addresses it
EU AI Act (Regulation (EU) 2024/1689)	Art. 50 (transparency obligations for AI interacting with natural persons); Art. 13 (transparency); Art. 14 (human oversight)	AI-system-disclosure component in chat UI; mandatory disclaimer on every response; helpdesk fallback as documented human-oversight path
Medical Device Regulation (Regulation (EU) 2017/745)	Art. 2(1) (definition of medical device); Annex VIII Rule 11 (software classification)	System explicitly does NOT provide clinical decisions — positioned as an information retrieval tool (zoektool); see classification analysis in EU AI Act Compliance
GDPR (Regulation (EU) 2016/679)	Art. 5 (principles); Art. 25 (data protection by design); Art. 32 (security of processing)	PII detection; audit logging; data minimisation (no patient health data stored); see DPIA
Belgian eHealth framework	Belgian-specific health-IT requirements	Dutch (Flemish) language support; hospital helpdesk integration; channel positioning consistent with other patient-information surfaces

The decision to position the system as an information retrieval tool rather than a clinical decision support system is the load-bearing classification: it avoids MDR re-classification as software-as-a-medical-device, while the multi-layer architecture provides the transparency and auditability that the AI Act requires for AI systems operating in sensitive domains.

See GDPR (Regulation (EU) 2016/679) for canonical text. See EU AI Act (Regulation (EU) 2024/1689) for canonical text. See EU MDR (Regulation (EU) 2017/745) for canonical text.

Voice channel — same invariants, different surface

The voice channel implements the same zero-medical-advice invariant through a structurally similar but stage-merged architecture: pre-LLM regex pre-filter → agentic LLM with tool-grounded retrieval → post-LLM regex safety post-filter → disclaimer prepender. See Voice Safety Architecture for the full detail; the high-level mapping is:

Text-channel layer	Voice-channel equivalent
Layer 1: intent classification	`voice_thin_pre_filter.classify_terminal()` returning a `TerminalClass` (SAFETY_REFUSAL is the safety-critical branch)
Layer 2 + 2b: regex + LLM judge	Combined into the agentic orchestrator's tool-grounded retrieval (`VoiceLLMOrchestrator`, ADR-0051) plus a post-LLM regex post-filter
Layer 3: quality gate	RAG retrieval's `found=False` consecutive count, escalating to helpdesk after two strikes
Layer 4: disclaimer	Post-LLM disclaimer prepender (re-activated 2026-05-09, Wave 2.C)

References

@owasp_llm_top10 — LLM01 (prompt injection), LLM06 (sensitive information disclosure), LLM09 (over-reliance) framings used throughout this page.
ADR-0036 — adversarial-input hardening (perplexity-based anomaly detection, LLM-as-judge, retraction enforcement).
ADR-0049 / ADR-0051 — voice-channel orchestrator lineage.
Voice Safety Architecture — voice-channel safety detail.
DPIA — GDPR Art. 35 risk assessment.
EU AI Act Compliance — AI Act conformity assessment.
Amann, J., Blasimme, A., Vayena, E., Frey, D., & Madai, V. I. (2020). Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, 310. https://doi.org/10.1186/s12911-020-01332-6

The safety imperative​

Multi-layer architecture​

Layer 1: intent classification guard​

Trade-offs​

Safety message​

Refuse-and-link (June 2026 — policy reversal)​

Input-side intent lanes (June 2026)​

Layer 2: post-generation regex safety validation​

Why regex and LLM, not just LLM​

Layer 2b: LLM-as-judge safety validation​

Fail-closed guardrails (April 2026)​

LLM judge prompt rigor​

Layer 3: quality gate​

Layer 4: mandatory disclaimers​

Why an injected disclaimer rather than a prompt instruction​

Zero-tolerance metrics​

Defense-in-depth verification matrix​

Regulatory mapping​

Voice channel — same invariants, different surface​

References​