Release Notes: May 31, 2026 (Final Sprint)
This note called itself "final," and at the time it was. Hours later a live jury demo surfaced a real medical-advice leak, opening a closing fortnight of safety hardening. The true final note is June 1–8, 2026 (Project Close).
Final Consolidation · 184 commits · 3 days · ~12 PRs · the loose ends become a finished system
~184 commits since the May 29 note | ~12 merged PRs | 5 pilot deploys | 6 features behind eval/safety gates | 0 medical-advice incidents
This is the final release note for the ZOL Intelligent Search graduation project. Where earlier windows each chased one defect, this sprint was about closing every open thread at once — turning a pile of flag-gated experiments and half-wired features into a coherent, deployable whole. The work falls into seven streams:
- Multi-tenant chat goes real — a tenant is now resolved per-request from a URL slug, fail-closed, with Romanian as the first non-Dutch locale (code + safety only).
- The synthesis safety boundary gets a principled gate — the brittle
topic_keyregex is replaced by an embedding-similarity gate (calibrated, default ON). - Grounded medical-dosing answers finally cite their sources — five distinct citation bugs across both medical render paths, fixed and merged.
- The cache learns about channels — a voice answer can no longer be served to a web user, and operators get per-channel visibility and a real clear button.
- Deterministic routing is unified — chat and voice now share one capability registry; the concurrent-session bug that produced HTTP 500s is resolved.
- The chat voice is de-contaminated — chat stops borrowing the voice orchestrator's terse prompt; answers get richer with identical safety.
- Voice speaks the caller's language — turn-1 language locking, per-language schedule/answer localisation, and natural prose rendering.
Plus a corpus-wide documentation pass (canonical Glossary, SNOMED golden-pages, a six-group sidebar, a Docusaurus v3 admonition fix across 155+ blocks) and the public chat's first real product polish (history persistence + mobile layout).
The transferable theme of the whole sprint is at the bottom: almost everything here shipped behind a flag and a gate, and the gates did their job.
1 · Multi-tenant chat — Layer A wiring + Romanian Layer B
PRs: #90 (Layer A) · #100 (Layer B Romanian) · Status: merged + deployed; Romanian dormant pending native-RO safety review
The architecture was always multi-tenant; this sprint made it operable from a URL.
- Slug → tenant resolution. A new globally-unique
hospitals.chat_slugcolumn (migration 080) plus a startup-cached resolver (46adb98b,70d47ff8,10dfdab9) means a request to/chat-zol1resolves to the ZOL tenant fail-closed — an unknown slug returns 404/422, never a default-tenant leak. The frontend routes/:chatSlugto the backend (8da20289), guards invalid slugs at the route level (ce1ac533,e2636958), and admins get a Chat-URL field with live preview and inline 409/422 errors on a uniqueness race (43eca9e1,d76d1e1b,2decabbe). - Per-tenant ingest language filter (
b4c29b98) —hospitals.languagenow scopes the nightly crawl so a tenant only ingests its own language. - Romanian Layer B (
34f05b81,4623eb5c,a66e2415,11c0e1fc) —rolanguage gates on the chat path, Romanian post-generation medical-advice patterns at safety layer 3 (parity with the Dutch set), one Romanian diagnosis example in both intent-classifier builders, and aro.jsonpublic-chat locale registered in i18n.
No Romanian tenant is seeded, so /chat-ro1 is dormant by design. Per the project's no-hardcoded-tenant-data rule, tenant facts come from the DB; only phonetic-recovery overlays live in YAML. Before any RO tenant goes live, the Romanian _SAFETY_PATTERNS need a native-speaker review — regex safety patterns cannot be trusted from machine translation alone.
2 · Semantic synthesis gate — embeddings replace a brittle regex
PR: #109 · Commits: 678fa1d3 (default ON) · 631bc3dc (threshold 0.72) · Status: LIVE + VERIFIED on pilot, flag default ON
The gate that decides whether a blocked medical query may be synthesised from grounded brochure content (A2) versus raw-dumped (F2) was previously keyed on a topic_key regex — and punctuation alone could flip the verdict, which is exactly the kind of brittleness you cannot have on a safety boundary.
It is now an embedding cosine-similarity gate: the query is compared against a set of vetted, tenant-agnostic exemplars of legitimate informational synthesis, and only clears the gate above a calibrated threshold of 0.72 (5991644c, 282206b1, 629409e9). A calibration script (41f0004c) established the threshold as separable with a ~0.08 margin between the must-synthesise and must-refuse clusters. The gate is fail-safe — any embedding error refuses rather than synthesises (de9ef1dc makes is_strongly_grounded async; a2961e2e computes the A2 query embedding once and awaits the gate).
Live probe result: every must-refuse query refused (no dose leak), and both phrasings of the "nurofen" query now consistently A2-synthesise (0.807 / 0.8013). Kill switch: SEMANTIC_SYNTHESIS_GATE_ENABLED=false.
3 · Grounded medical-dosing — synthesis and citations, end-to-end
PRs: #104–#108 · Status: all merged + deployed
Two things landed here. First, A2 grounded synthesis in the blocked branch (9d91d97f, 341c101b, a0e22792) — when a medical query is grounded in ZOL's own pre-vetted brochures with a valid citation, the system may surface that content rather than a bare refusal. The safety line moved from "no dose ever" to "no LLM-speculated content," with an is_strongly_grounded gate and a critical insulin-leak fix (bae76bba): the A2 synthesis allowlist is restricted to pediatric_medication only; default/urgent intents still refuse.
Second, the answers finally cite their sources. Five distinct bugs across the two medical render paths:
| # | Bug | Fix | Commit |
|---|---|---|---|
| 1 | A2 built citations incorrectly | _build_dosing_citations(hits, answer) dedups by URL + remaps [N] | 8703fd90 |
| 2 | Final WS frame hard-coded citations=None (frontend reads the FINAL frame) | forward error_chunk.citations | 7913605e |
| 3 | F2 path attached no citations | reuse _build_dosing_citations | 04b088c9 |
| 4 | F2 markers rendered as links [N](url) (no hover) | strip to bare [N] → hover chips | be6e83e8 |
| 5 | Medical answers had no follow-ups (+ DRY smell) | centralised _qs_emit_follow_ups | adffd37b |
Plus brochure-noise stripping in the F2 dump — page-footers, Col2, section-numbers (b1fc9bd5, 9ccd677e). Lesson: trace UI bugs to the rendering frame, and remember there are two medical paths (A2 synthesise + F2 raw-dump) — a fix to one is not a fix to both.
4 · Cache channel-scoping + operator control
Status: deployed
A semantic-cache entry produced for a voice call could previously be served to a web user — different formatting rules, different citation handling. Migration 079 adds channel to the semantic_query_cache key (8600f80e), and read/write are now channel-scoped (9e3ae2d2, 0f30c73f) — an R3 cross-component contract fix.
Operators also get real visibility: a unified cache panel with per-channel counts and a ConfirmBar clear (60307c4a, d9cde793, 4942eeaa), a real intent-cache count and true hit-rate percentage in the admin UI (21a3002c), and a Clear-Cache button that actually clears the Redis intent cache (5f945aff) — previously it cleared only the semantic cache.
5 · Unified capability registry + the concurrent-session fix
Capability registry — chat shipped (parity-preserving), voice flag-off. Chat and voice had separate, divergent deterministic routing for doctor_schedule / billing. A new intent-keyed registry under app/services/capabilities/ (37851ad4, 41070a58) now backs both: chat's Stage 2c/2d route through it parity-preserving (dc7d55d9, 1c61a1ba), and a classify-first voice gate sits behind capability_registry_voice_enabled (default off; 495f7429). A 30-turn behavioral catalog gate passed on pilot (0 mis-captures; e3456b07).
Doctor-schedule LLM routing (deployed). "wanneer werkt Dr. X" now routes to the existing schedule lookup via LLM intent rather than a too-narrow regex (9cc8e647), and cites the source document on the short-circuit (13955be9).
Concurrent-session / speculative-retrieval bug — RESOLVED. F2/A2 brochure retrieval raced speculative retrieval on the same DB session, producing HTTP 500s under concurrent load. Fixed by running speculative retrieval on its own DB session (flag ON; 669d9961, f3c99c08) plus drain-not-cancel (d8be88db awaits _qs_cancel_speculative). 354 orphaned "bezig met verwerking" stubs were finalised/reconciled (2249ada9, 4581eeae).
6 · Chat-prompt de-contamination
PRs: #102 · #103 · Status: merged + deployed, flag default ON
Chat answers had gone terse. The cause: a 2026-05-22 agentic rewrite had handed chat a voice-orchestrator prompt — full of tool references chat can't call and terse few-shot examples tuned for speech. The fix is a purpose-built, de-voiced chat RAG system prompt (8a26bd60) plus a shared SAFETY_RULES_BLOCK constant (ebad6239, 8f26ff9b) so both channels enforce identical safety from one source. Flag-gated (0e9ff6a1), routed (83173aba), and defaulted ON after a 99-question pilot benchmark (f9635f53): +30% answer detail, +68% citations, safety byte-identical. R3 contract tests pin "safety present in both arms" (eb55372a).
7 · Voice speaks the caller's language
A cluster of voice-language correctness fixes:
- Turn-1 language flip (
b59eec1a) — Deepgram multi-mode code-switched Flemish was flipping a whole call to English on a 1-vote lexical margin. Aswitch_margin=2hysteresis (default 1 = backward-compatible) holds the default language unless a new one wins decisively. Track-2 prefix-warm (0fa23b94) warms the orchestrator prompt-prefix during the greeting. - IT/FR turn-1 lock (
53d1e1c9) — lexicon-based turn-1 language lock for Italian and French, with cover-filler on switch and denser mid-call fillers. - Doctor-schedule localised (
b6742d1c) — schedule answers now render in the caller's language (nl/en/fr/it), not always Dutch. - Schedule as prose (
589fd11e) — the full doctor schedule renders as natural prose, notVM/NMtable shorthand. - Dispatcher#2 collapse (
6293405b,c9e86183) — gated to Dutch turns with a clock-time naturalisation rule, behind a behavioral catalog + leak detector (5389ffaa). - Barge-in telemetry (PR #98;
0df74ac2,1337ab16) — allsession.say()routed through_say(), tolerant of a closing session, with barge-in telemetry. - Spoken-number leak (PR #91/#92;
23376f2b,984e0bdf) — channel-agnostic fallbacks now use digit-form phone numbers on the web channel, not the spoken form.
8 · Web chat correctness, persistence & mobile
- Answer in the user's language (
592d2123) — chat answers in the detected language, not Dutch by default. - Detect on the RAW query (
eb23ca20,18528594) — language detection runs on the raw query, not the STT-normalised text, fixing aro→nlregression; raw query threaded through the no-history classify path too. - Capability short-circuit language (
9798c531) — uses the classification language, not unset state. - Follow-up chips in the conversation language (
d1636289) and responsive on mobile (8e63b786). - History persistence (
7dcadb3e) — public chat persists conversation history inlocalStoragewith a 30-day TTL. - Mobile (
5eb44827,3b68f534) — public-chat clipping fixed, safe-area insets honoured, feedback buttons no longer hidden behind the bottom bar.
9 · Documentation & evaluation infrastructure
- Canonical Glossary (
ea64cfff,8857e5e3) — one definition per concept with stable deep-link anchors; today's pass expanded the BM25 entry (TF/IDF/length-normalisation, the scoring formula in plain text,k₁/b, a worked example) and linked the Glossary in the top navbar (d43af489). - Knowledge-graph docs (
356b63ec) — Golden Pages page, the SNOMED 5-tier matcher, the two-layer taxonomy diagram; SNOMED concept count corrected (280K → 356K). - Sidebar reorganised into six groups (
cd0f6c11); historical asides moved to an appendix; stale Pydantic-AI-as-current purged (bdfeea90). - Docusaurus v3 admonition fix (
6ca10480) —:::type Title→:::type[Title]across 155+ admonitions that were leaking as literal text; plus the query-rewriting canonical page and enrichment de-dup. - Mermaid hardening (
744c0105) —<br/>removed from all diagrams; sequence-Noteparse-breakers fixed. - Prompt-engineering standards (
222ef4e5,24b9fcba) — the SP-0…SP-3 program: a P1–P7 rubric page, with forensic metadata and benchmark-ID breadcrumbs stripped from the intent classifier, chatSAFETY_RULES_BLOCK, and voice safety rules. - Golden-eval defaults to Claude-as-judge (
34d372ad) — no LLM-judge token spend on every run.
The transferable lesson — gates earn their keep
Nearly everything in this sprint shipped behind a flag and a measurable gate, and across the window the gates repeatedly did their job:
- The semantic synthesis gate replaced a regex whose verdict flipped on punctuation — but only after a calibration script proved the threshold separable with margin, not because the embedding approach "felt" better.
- The A2 synthesis allowlist was tightened to
pediatric_medicationonly the moment a probe showed an insulin-dose leak path — the gate caught it before it reached a caller. - Chat-prompt v2 flipped to default-ON only after a 99-question benchmark showed +30% detail with byte-identical safety.
- Romanian Layer B is code-complete and deployed, yet deliberately dormant behind a native-speaker safety review, because a machine-translated safety regex is a hypothesis, not a guard.
A production system aimed at zero medical-advice incidents cannot be built on intuition about safety. It is built on flags you can flip back, gates you can measure, and the discipline to leave a feature off until the measurement agrees. That discipline — a plausible fix is a hypothesis until the data agrees — is the through-line of this entire project, and it is why the headline metric held: medical-advice incidents: zero.
This note closed the consolidation sprint; the project's true closing note is June 1–8, 2026 (Project Close). See Effort Estimation for the full 17-week timeline and the Glossary for the canonical definitions of every term used above.