Release Notes: May 9 – 13, 2026
Pilot-Review Readiness · Autonomous Latency Wave · 5 New ADRs · Voice Overlay Admin · Q5 RCA · Voice Waves 0-2
~155 commits | 5 days | 5-phase docs initiative | ADRs 0053→0057 (accepted) + ADR-0059 (proposed) | autonomous latency wave (-700ms/call) | Voice Overlay Admin (Sprint E A→D) | Data Quality A/B/C/D | methodology v2.3 Brainstorm Gate | Q5 RCA: 24 docs restored from dedup misfires, MedChat 50-Q avg 87.5 → 93.9 | Voice Waves 0-2: disclaimer once-per-conversation, BILLING_INQUIRY intent, pharmacist-deflect category, per-language reply hint, latency-budget calibration
This release is the largest single window since the project pilot began — three times the commit volume of May 4-9 and qualitatively different in shape. Where the previous window was a sprint of new features (Value Dashboard, Value Framework), this one is a sprint of maturation: re-aligning docs to code after months of architectural drift, codifying five overdue decisions as ADRs, dropping ~700 ms of synchronous wait from every chat turn, and shipping the editable admin surface that turns the voice overlay system from "engineer-only YAML" into "hospital-admin clickable UI".
The headline themes:
- Pilot-Review Readiness initiative — 5 phases, ~7 000 LOC of documentation work. Audit (Phase 1, 6 drift registers) → cascade fixes (Phase 2, four sub-batches) → voice compendium (Phase 3, transferable white paper) → SOTA positioning matrix (Phase 4, 18 vendors × 8 axes) → pilot review artifact bundle (Phase 5, 5 reviewer-ready artifacts). The project's docs and code are now in alignment for the first time since the November 2025 voice cut.
- Autonomous latency wave (O3 / O4 / O5 / O10 / O12 / O16) — p50 chat latency now budget-bound by retrieval, not by prompt assembly or telemetry writes. Six independent fixes shipped from a single research report: lru-cached prompt assembly, singleton
AsyncOpenAIclients, batchedexecutemanyfor pipeline_telemetry, pre-warmed prompt cache for nl/en/fr/it at startup, expanded_SAFE_INTENTSto skip the LLM safety judge on procedural answers, and the new/api/v1/admin/ops/latency-percentilesoperator endpoint. The biggest single win — ~700 ms per call — came from replacing the pydantic-ai migration (yes, the one that landed earlier in this same window) with a thinstructured_callhelper. - Five new ADRs — 0053 through 0057. ADR-0053 retroactively documents the Neo4j removal (16 000 LOC deleted in March, finally written up). 0054 codifies the intent classification cache with Redis backend + admin kill switch. 0055 declares the FAQ-corpus drift prevention policy that purges 10 hand-curated ZOL FAQs in favour of the corpus. 0056 ships the chat answer-shape typology — six shape patterns instead of sixty per-defect rules. 0057 introduces tenant-scoped prompt addendums + tenant-agnostic doctor-profile boost — the right-layer-of-abstraction pattern for new hospital onboarding.
- Voice Overlay Admin — Sprint E Waves A through D shipped end-to-end. Waves A/A.5 unified the voice routing rules and folded the medical taxonomy into the tenant overlay; Wave B exposed a read API; Wave C built the viewer UI; Wave D added the full edit / delete / inline-edit / YAML import/export surface plus the empty-tenant onboarding banner. Hospital admins can now CRUD their own voice overlays without touching YAML.
- OpenRouter removal completed. Phase 2 deleted the OpenRouter code paths after Phase 1 (last sprint) made
gpt-4.1-minithe default. - Data Quality A / B / C / D — code-quality discipline applied to the data layer. Layer A = nightly audit script + scheduler. Layer B = canonicalization + dedup gate at ingest, schema-enforced via migration 068 partial unique index. Layer C = Lorem ipsum sanitization + schedule-table extractor (structured JSON from ZOL Drupal tables). Layer D = post-completion verification. Driven by the 2026-05-12 audit that found 258 abandoned docs, 40 duplicate doctor profiles, 312 Lorem ipsum chunks sitting silently for weeks.
- Methodology v2.3 — Brainstorm Gate. Six-axis Decision-Cost Rubric and Pre-Mortem Block — the project rule that started enforcing itself this week after the pydantic-ai investment had to be undone.
1 · Pilot-Review Readiness — 5-phase doc/code re-alignment
The plan (docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md, commit b50dae96) was conceived as a single 4-working-day arc with five sequential phases. Each phase had a clear acceptance criterion; each phase output is now shipped as Docusaurus pages.
Phase 1 — Code↔doc audit (6 parallel drift registers)
Phase 1 produced six drift registers under docs/audits/, one per topic area, all run against master tip b50dae96. Read-only audits — no code touched.
| Audit | File | Result |
|---|---|---|
| Voice docs (21 pages) | 2026-05-09-voice-docs.md | 14 🔴 / 12 🟡 / 9 🟢 |
| Architecture docs | 2026-05-09-architecture-docs.md | 9 🔴 / 17 🟡 / 10 🟢 |
| RAG docs | 2026-05-09-rag-docs.md | 22 🔴 / 19 🟡 / 12 🟢 |
| ADR register (51 ADRs) | 2026-05-09-adr-register.md | 9 🟡 / 3 🟠 / 2 🔴 / 4 ⚫ |
| API surface (248 routes) | 2026-05-09-api-surface.md | 5 🔴 / 27 🟡 / 8 🟢 |
| Frontend docs ↔ UI | 2026-05-09-frontend-docs.md | 8 🔴 / 11 🟡 / 6 🟢 |
The three load-bearing voice findings tell the story: local-setup.md Step 7 imported a deleted VoiceOrchestrator; conversational-intent.md documented a three-tier resolver that no longer exists; triple-defense.md described Layers 1+2 modules that were both deleted in the May 2 thin-pipeline cut. A new developer reading these pages would build a mental model of a system that hasn't existed for two months.
Phase 2 — Documentation excellence cascade (batches 2a → 2d-b6)
Phase 2 fixed the red entries in four sub-batches, each landing in its own commit:
- 2a — Architectural ground truth (
c9602830). ADR-0053 (Neo4j removal) backfilled — see §3 below. ADR-0017 amendment to mark Stage 2c as deprecated. Bibliography + ADR index rebuilt. - 2b — Cascade doc fixes (
4a975dd8). BGE-M3 references replaced withtext-embedding-3-large; legacy 8-stage voice pipeline references replaced with the thin pipeline; chunk-direct citation pipeline documented; BibTeX bibliography rendered correctly under Docusaurus 3.10.1's MDX v3. - 2b-prime — Close cascade gap + 4 ADRs ported (
9b037b25). Four orphan ADRs (0050Twilio + LiveKit SIP,0051,0052,0053) ported intodocs/decisions/. Two earlier ports amended. - 2c — Safety-critical revalidation (
6f750c4b). Empirical fast-gate study (raised threshold 0.40 → 0.50 in Wave 2.C.1). Voice safety rewrite. Auth doc fix. Medical-content disclaimer reactivated on voice via post-LLM detection (0a67fa65). - 2d-b1 → 2d-b6 — Academic rewrite pass. Six tight batches across architecture, voice, RAG, safety/decisions, thesis/evaluation, and final batch. Cumulative: +27 new bibliography entries, ~50 page rewrites in an academic register, Mermaid theme upgrades for legibility.
Phase 3 — Voice stack compendium
docs/compendium/voice-stack.md — a 10 507-word transferable white paper covering the full voice pipeline from STT through dialogue management (and its deletion!) through TTS, tenant overlays, value framework, and Twilio LiveKit SIP. Designed to be read by an external engineer evaluating whether to license the voice stack for a phone-support / appointment-booking spinoff. The compendium is self-contained — every concept is defined inline; no Docusaurus internal links to pages that might churn.
Phase 4 — SOTA positioning matrix
docs/positioning/sota-matrix.md — 18 vendors (Retell, Vapi, Cognigy, OpenAI Realtime, Microsoft Healthcare Bot, Hyro, Twilio Engage, Voiceflow, Pinecone Healthcare, ElevenLabs Conversational, Daily.co Bots, LiveKit Cloud, etc.) × 8 axes (latency, hallucination rate, citation density, Dutch quality, multi-tenant isolation, EU residency, AI Act readiness, safety architecture). The matrix is opinionated — every cell has a citation or a "no public claim" marker, and the gap section is the part the user-facing pilot deck quotes.
Three differentiators that survived honest scrutiny:
- Citation density per claim (Pattern C / D / E / F enforces per-bullet markers — see ADR-0056). No competitor inspected emits per-bullet inline
[N]markers on health queries. - Multi-tenant safety architecture (medical-content disclaimer + crisis dispatch + AI Act §50(2) compliance enforced at the layer the LLM cannot bypass).
- Hospital-agnostic Value Framework (intent × category affinity rerank). No competitor inspected applies category-typed rerank to LLM context selection.
Three honest gaps that survived:
- STT quality on Dutch dialects still trails Deepgram NL-FL (current vendor) by ~3 WER points vs Azure Speech NL.
- Audio-loop evaluation harness — voice eval is currently turn-text-based; competitor benchmarks include audio-fidelity loops we haven't built.
- Per-tenant affinity overrides — the matrix is currently a module-level Python dict; multi-tenant production needs a DB-backed override table.
Phase 5 — Pilot review artifact bundle
docs/pilot-review/ — five reviewer-ready artifacts. Designed for a pilot-customer exec sponsor's pre-meeting reading window of ~20 minutes.
| Artifact | Length | Audience |
|---|---|---|
pilot-deck.md | 1 652 words | Exec sponsor (10 slides as markdown) |
architecture-one-pager.md | 804 words | Their CTO (Mermaid layered stack + per-turn sequence) |
demo-script.md | 1 922 words | Anna Verstraeten persona; 7 worked scenarios sourced from GQ-001 / GQ-008 / GQ-017 / etc. |
engineering-rigor.md | included | Test-coverage matrix, ADR count, methodology v2.3 mention |
q-and-a-prep.md | included | Anticipated questions + honest answers |
After Phase 5 commit fea42924, the initiative is COMPLETE. Total cost: ~7 000 LOC of documentation work (audits + cascades + compendium + matrix + bundle), zero production code changes from this phase alone.
2 · Autonomous latency wave (O3 / O4 / O5 / O10 / O12 / O16 + pydantic-ai swap)
The research report that paid for the wave
docs/2026-05-11-latency-opportunities-research.md (commit 01f59be0) is an end-to-end profile of the chat pipeline with per-stage p50 / p95 measurements. The report identified eight independent optimization opportunities (O3 through O16) and ranked them by (impact in ms) × (reversibility) ÷ (engineering cost). The wave executed the top-six during the night of 2026-05-11 → 12, autonomously, on a fresh branch.
What landed
| ID | Commit | Saving | Pattern |
|---|---|---|---|
| O3 | aae3c2a4 | ~120 ms median | Expand _SAFE_INTENTS to skip the LLM safety judge on three procedural answer classes (appointment_scheduling, navigation_or_practical_info, general_chitchat) that cannot semantically be medical advice |
| O4 | 7786a20c | ~30 ms median | Singleton AsyncOpenAI per (api_key, base_url, timeout) — eliminates the per-request HTTPS handshake to OpenAI's edge |
| O5 | a48efef9 | ~5 ms × N intents | lru_cache(maxsize=128) on build_rag_system_prompt(language, tenant) — the prompt is now built once per tenant per language, not once per query |
| O10 | 238ef215 | DB load reduction | Batched pipeline_telemetry INSERTs via executemany instead of one INSERT per turn |
| O12 | 014bf61c | Cold-start removal | Pre-warm RAG prompt cache for nl/en/fr/it at startup — the first request after a deploy no longer pays the cold-build cost |
| O16 | 876bb63f | Observability | GET /api/v1/admin/ops/latency-percentiles?days=N&channel=voice — p50/p95/p99 per stage; the dashboard that quantifies all the above |
The pydantic-ai swap — biggest single win
Earlier in this same window, the project migrated 9 LLM JSON-output call sites to pydantic-ai (commits 942c2cbe → 59bd5727, batches A–F). The migration eliminated manual json.loads + Pydantic.model_validate chains, added native retry-on-validation-error, and removed the _parse_llm_json band-aid chain. It looked clean. It tested clean.
It also added ~700 ms per call on average — because pydantic-ai instantiates a full Agent[None, OutputModel] graph per request, including provider abstraction, tool registry, and OpenAI client setup. That setup cost dominated short prompts (intent classification, query rewrite) where the underlying LLM call was already only ~300 ms.
Commit b8d8da67 replaces pydantic-ai with a thin structured_call(prompt, output_model) helper that:
- Calls OpenAI directly with
response_format=json_object. - Round-trips the JSON through
model_validate_json(Pydantic v2 native). - On
ValidationError, makes one retry with the error message appended to the prompt.
Net effect: ~700 ms saved/call on every call site that was migrated. The five migration commits are kept in history — the lesson, codified as memory feedback-llm-mix-only-prompt-shrink-approved.md, is that every migration of an LLM-call helper layer must be measured end-to-end against the previous helper, not against the call-site code it replaced. The pydantic-ai code looked simpler; the wall-clock got worse.
The F2 misdiagnosis report (docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md) is the companion artifact — F2 (intent-prompt shrink) was queued as the next perf win after the pydantic-ai migration, then cancelled when the post-deploy benchmark showed intent_classification p50 = 2 454 ms, three times the research estimate. The latency wasn't in the prompt; it was in the helper. The wrong target was correctly not attacked.
Docker fixes that enabled the deploy
408b0979— Pip version specifiers must be quoted in shell-redirect contexts inDockerfile.app. The unquoted formpydantic-ai==1.93.0parsed as redirection to file1.93.0and silently installed nothing.1a9412d0— Switched topydantic-ai-slim[openai]==1.93.0to drop the yankedmistralaitransitive dependency. The fullpydantic-aipackage depended on it; the-slimvariant doesn't.
These two fixes together unblocked the May 12 02:00 pilot deploy that carried the entire wave.
3 · Five new ADRs — 0053 through 0057
The pilot-review audit identified five accepted-but-undocumented decisions that had been operating as production code without an ADR. All five were written up this window. Each is summarised below; the full text lives in docs/decisions/.
ADR-0053 — Remove Neo4j, consolidate graph context onto PostgreSQL taxonomy
Retroactively documents the March 7 primary removal (commit d82b1592, ~16 000 LOC deleted). The motivation, in three lines: PostgreSQL with pgvector + the taxonomy schema provides everything Neo4j gave us (typed-node entity traversal via JOINs); the dual-datastore cost (operational, query-time, conceptual) outweighed the win; the architectural cleanup in May (commit 158d793) finalised the consolidation. Supersedes ADR-006 + ADR-0029. Amends ADR-0017 (Stage 2c graph search → deprecated), ADR-0028 (golden-page Neo4j seeding → no-op), ADR-0030 (entity-extraction now routes to PG, not Neo4j Cypher).
ADR-0054 — Intent Classification Cache
Adds a (tenant_id, normalized_query, language) → IntentClassificationResult cache between request entry and the LLM classifier. Two backends:
- Memory (default, per-worker LRU + TTL) — single-worker pilot config.
- Redis (opt-in
INTENT_CACHE_BACKEND=redis) — shared across worker replicas via the existingapp.db.redisconnection pool.
Poisoning guards: write only when confidence >= 0.85 and intent != UNKNOWN. The Redis backend's cache survives container restarts — the compensating control is the operator "Clear Cache" button on PlatformSettingsPage which wipes both this cache and the semantic_query_cache in a single click (POST /api/v1/settings/cache/clear).
Cache-hit removes ~2 300 ms from per-turn latency. Stacks with the semantic_query_cache: if both hit, the full pipeline collapses to ~50 ms.
ADR-0055 — FAQ-Corpus Drift Prevention (the FAQ purge)
The 2026-05-12 audit of the 10 ZOL-specific FAQ entries against the live pilot corpus found:
| Verdict | Count |
|---|---|
| Directly contradicted by corpus | 3 |
| Incomplete | 2 |
| Unverified (no corpus evidence) | 3 |
| Aligned | 2 |
Drift typology: every entry making list claims, service routings, or counter-corpus assertions had drifted within ~3 months. The two aligned entries were single immutable facts (phone, address).
Phase 1 (executed, commit a9820c3f): purge all 10 ZOL FAQs from zol.yaml. Preserve only the four safety/policy entries (crisis_suicide_ideation, three emergency dispatch rules) plus the overnacht_ambiguity clarification.
Phase 2 (planned): nightly audit_faq_corpus.py cron — for each surviving entry, retrieve top-k chunks via RAG, ask GPT-4.1-mini to verdict against the FAQ answer, alert on CONTRADICTED.
Phase 3 (planned): demand-driven promotion — observe conversation_messages → cluster → score against RAG quality → auto-draft FAQ entries on low-quality × high-demand clusters. FAQs become fresh by construction instead of hand-authored guesses against last quarter's mental model.
The pharmacy incident that triggered the audit cannot recur — voice and chat now answer identically because both go through the same RAG path, not through the divergent FAQ-then-RAG cascade.
ADR-0056 — Chat Answer-Shape Typology (six patterns, not sixty rules)
After the competitor at zolcase.novation.website/slim-zoeken produced visibly more scannable answers to "Wat is een gastroscopie?" the project's CHAT_BOLD_LEDE_RULE (added earlier the same day, commit b88fc086) was expanded into a typology:
| Pattern | When | Shape |
|---|---|---|
| A — POINT-FACT | Single discrete answer (phone, address, hour) | 1-2 sentences + 1 citation |
| B — STEP-BY-STEP | "How do I X?" procedural | Numbered list, citation per step |
| C — ATTRIBUTE-LIST | Single topic, 3+ distinct attributes | 1-2 intro paragraphs + bullets with bold labels (**Duur:** ... [3]) |
| D — MULTI-ENTITY | Question covers multiple entities | One bold-lede paragraph per entity (the former CHAT_BOLD_LEDE_RULE) |
| E — COMPARISON | "X vs Y", "verschil tussen…" | Brief intro + two parallel bold-labeled sections |
| F — DECISION-TREE | "Wanneer moet ik X?" / triage | Conditional bullets: Bij ernstige…: action |
The rule applies to chat only — voice continues as natural prose (bullets read awkwardly aloud). Voice path verified unaffected at rag_service.py:4641 injection site.
The architectural principle: typology beats rule-per-defect. Adding one rule per visible problem produces a prompt that grows linearly with defect history; a typology compresses future maintenance because new defects fall under an existing pattern.
ADR-0057 — Tenant-Scoped Prompt Addendums + Tenant-Agnostic Doctor-Profile Boost
When asked "Is er raadpleging voor Dr. Matthias Dupont op woensdag?" the system answered "geen raadpleging" while the competitor answered "Ja, woensdagvoormiddag." The canonical doctor profile in the corpus had the truth (markdown table cell VM × WO = RP2w), but two compounding causes inverted it: the LLM couldn't parse the ZOL schedule format, and retrieval thematic co-retrieval (cardiology page, Arts Anders interview page) diluted the canonical chunk's signal.
The fix lands at two layers:
- Layer 1 —
_TENANT_CHAT_ADDENDUMSregistry.{slug: addendum_text}mapping injected into the chat system prompt after the answer-shape rules. Currently{"zol": ZOL_DOCTOR_SCHEDULE_RULE}— the rule contains the abbreviation legend + worked example + counter-example. Tomorrow's new tenant onboards without touching the shared template. - Layer 2 — tenant-agnostic
_boost_doctor_profileinsearch_service.py._DOCTOR_NAME_PATTERNregex extracts "Dr. <Name>" across nl/en/fr/it. Chunk scores get a 1.50× boost when the document title starts withdr. <name>. The 1.50 calibration sits between the campus boost (1.10) and the conversation-context boost (1.40); strong enough to pull the canonical profile from rank 2-3 to rank 1, not so strong it crushes other signals.
The architectural distinction codified in this ADR is now project policy:
| Fix shape | Isolation surface | Example |
|---|---|---|
| Tenant-specific data format | _TENANT_CHAT_ADDENDUMS[slug] | ZOL schedule table |
| Universal naming convention | _boost_* method, tenant-agnostic | Dr. <Name> title-prefix boost |
| Universal answer shape | CHAT_ANSWER_SHAPE_RULES | ADR-0056 |
| Tenant-specific FAQ | YAML in tenant overlay | ADR-0055 surviving entries |
4 · Voice Overlay Admin — Sprint E Waves A through D
The voice overlay system (tenant phonetic recovery + medical taxonomy + crisis dispatch + STT rules) was previously YAML-edited by engineers. This sprint built the full admin surface — hospital admins can now CRUD their own overlays from the /admin UI.
| Wave | What | Commits |
|---|---|---|
| A | Unified voice routing rules (one rule per slug, language-aware) | 9e4e9ab5 |
| A.5 | Medical taxonomy → tenant overlay (move from hardcoded Python dict to YAML) | c9d994c2 |
| A pt 2 | Pre-LLM crisis/emergency dispatch (regex pre-filter on the voice path, latency-zero) | f85726d7 |
| B | Voice-overlay read API (GET /api/v1/admin/voice-overlay/{slug}) | e3297cd0 |
| B test | Pin crisis_suicide_ideation in Wave B spec contract | b7c91292 |
| B refactor | Extract list_known_slugs to registry | f0770fce |
| C | Voice-overlay viewer UI | 0e82ba5d |
| C spec | Zol-default + missing test coverage | 500f2ade |
| C polish | A11y + type-safety | 4382ee80 |
| D backend | Voice-overlay write API + YAML import/export | 1130f103 |
| D1 polish | UP035 + dead async helper drop + 404 race subclass + audit cap | c06aa7df |
| D2 polish | Default taxonomy count + last-modified + a11y | 90b3c698 |
| D3 | Routing-rule edit modal + delete flow | 9b6ae389 |
| D3 review | Backdrop, regex neutral state, focus trap, narrowed types | 4ce002b1 |
| D4 | Taxonomy inline-edit + add-new row | 324f7bfa |
| D4 review | Keyboard tests + autofocus on edit | e9ea4bdc |
| D5 | YAML import/export tab | 5214b88f |
| D5 review | Blob URL cleanup + test spy restore | f1835045 |
| D polish | Empty-tenant onboarding banner | 2ba8956a |
| D foundation | Mutation hooks + Wave C UX polish | f570c687 |
The Wave D inline-edit pattern (commit 324f7bfa) is worth calling out — the row goes into edit mode in-place, ESC reverts, Enter commits, and there's an explicit + Add new row at the bottom. No modal for taxonomy edits. The routing-rule edit, by contrast, does use a modal because regex validation needs a preview before save.
The empty-tenant onboarding banner (2ba8956a) — final Wave D commit — handles the case where a hospital admin opens the page for a tenant whose YAML has not yet been seeded. Instead of an empty grid, they see a "Start from ZOL defaults" CTA that copies the canonical entries into their tenant slug as a starting point.
5 · Voice persona v1 + tenant-driven greeting
Commit c61ad16b introduces per-tenant voice persona — name, voice ID, greeting text, fallback line, available languages — all served from a new public endpoint (GET /api/v1/voice/persona/{tenant_slug}) consumed by the LiveKit voice agent at SIP-bind time.
Today, the ZOL persona reads:
"Goedendag, u spreekt met Zoë, de virtuele assistent van het Ziekenhuis Oost-Limburg. Hoe kan ik u helpen?"
— spoken by ElevenLabs voice ID pwMBnCuw3J0IFGFnFEFb at speed 1.05. The persona payload, the greeting, and the speed cap are all DB-backed now; an admin UI changes the persona's name from "Zoë" to anything else without a deploy.
Companion commits:
997c8db5— minimal hospital-context preamble + Q1 conv_id observability. The voice LLM no longer prefixes every turn with the full hospital description; the persona endpoint carries the context once at greeting time.62f18695— SIP-bound conversation_id adopted on first turn. Fixes the silent-failure regression where SIP-binding produced a stableconversation_idbut the first WS turn still generated a fresh one, breaking multi-turn memory.
6 · Seven RAG fixes (T1–T7) + F1 — MedChat 50-Q benchmark moves 87.5 → 91.1
The fixes from the 2026-05-11 comparison-RCA sprint (docs/2026-05-11-comparison-rca-fixes.md) shipped across this window. The benchmark moved from 87.5 avg / 3 wins / 21 losses to 91.1 avg / 23 wins / 7 losses / 0 P0.
| ID | Commit | What | Impact |
|---|---|---|---|
| T1 | 7ad18ff9 | Intent-classifier spoed routing + drop hardcoded "ZOL" from prompt + test consolidation | Q05, Q31, Q39 |
| T2 | e5d93ac8 | Race-guard against duplicate fallback row + language-distinctive test asserts | Q-fallback-race |
| T3 | 3dca6bef | 30s streaming timeout writes localized fallback — closes P0 "stuck on pending" | Q-pending-stuck |
| T4 | eb389ae2 | Q32 maagonderzoek compound regex + aanmelden comment fix + cache-stickiness note | Q32 |
| T6 | c7bc1fce | Intent-aware top-K + comprehensive-intent prompt — closes 12 coverage losses | Q-coverage × 12 |
| T7 | 819c704c | Query-rewrite expansion for ambiguous head terms — closes 3 retrieval misses | Q-ambiguous × 3 |
| F1 | 00e13367 | Inline [N] citation markers for chat channel — closes rubric ties | All chat |
| Shaper | fd081160 | Wire medical_taxonomy into doctor-enumeration — Q40 cardiology mapping | Q40 |
T5 was deferred — its scope expanded mid-implementation and was queued for the next planning window.
7 · Diagnostic v2 hardening (8 follow-up fixes)
The v2 diagnostic landed in May 4-9. Operator use revealed a string of edge cases:
| Commit | Fix |
|---|---|
8494b483 | json-repair fallback for malformed LLM output (close conv 7509b0e0 line-725 regression) |
f8542a34 | STT-mishearing awareness + retrieval verification + per-claim grounding |
e650385b | Operator feedback loop + diagnostic_accuracy_rate metric on Operations dashboard |
45c349a9 | Install json-repair in app image (pyproject dep not in base) |
df06ea33 | Widen full-list triggers + inject synthetic department-doctors chunk (close conv a60d3f30 regression) |
e9925d4a | Bump LLM timeout to 240s (close conv a60d3f30 hang) |
62db818c | Meta-contract pinning timeout vs max_tokens consistency |
f537713b | v1 fallback uses json-repair + v2 schema errors logged inline |
94f2373a | Relax DiagnosisV2 schema to match what gpt-5.2 actually emits |
1ef7bbd9 | Sanitize LLM output before Pydantic to absorb model variability |
93270328 | Coalesce overall_score in log line (pyright cleanup) |
50306ef7 | Migrate to pydantic-ai for native retry-on-validation-error (later reverted via b8d8da67 — see §2) |
9000c022 | Scale max_tokens with turn count + correct error code (close conv 7509b0e0 regression) |
fa6a247e | Bump max_tokens budget for 16+ turn calls (close conv 0745d984 regression) |
3495fe8f | Fire-and-forget POST + polling GET — close Cloudflare 100s gateway-timeout regression |
3be59d07 | 5 improvements for excellence — close the v2-investigation gaps |
d215a0a0 | 8 improvements for operator-language match, grounding visibility, voice telemetry attribution + resume/watch/category breakdown |
The Cloudflare 100s timeout fix (3495fe8f) is the most architecturally interesting — long v2 diagnoses on 16+ turn conversations were hitting Cloudflare's 100-second proxy timeout. The fix converts the synchronous POST /diagnose into a fire-and-forget POST that returns immediately with a job ID, plus a polling GET /diagnose/{id} that the UI hits every 2 seconds until the result is ready. Cloudflare sees only short requests; the actual LLM work runs uncapped on the backend.
8 · Intent Cache (Redis-backed, cross-worker) + admin kill switch
Spec'd by ADR-0054 (§3 above), shipped across three commits:
5fed3bef— In-memory intent cache (Experiment C). Per-workerOrderedDictLRU + TTL. Proves the concept under single-worker pilot config.17a7f56e— Redis-backed cross-worker backend + admin kill switch. Shared across worker replicas via the existingapp.db.redisconnection pool. JSON round-trip through Pydantic. SCAN-targetedintent_cache:prefix.f9a335c4— UI "Clear Cache" button now wipes intent cache too. Single click recovers from cache poisoning across both caches.
Verification on pilot zol-rag-app:f9a335c4 (2026-05-12):
curl -X POST .../api/v1/query -d '{"query":"Hoeveel ziekenhuisbedden heeft het ZOL?","channel":"web"}'
redis-cli --scan --pattern "intent_cache:*"
# Returns: intent_cache:|nl|hoeveel ziekenhuisbedden heeft het zol
Test coverage: 12 unit tests on MemoryIntentCache, 13 integration tests on RedisIntentCache against a Redis 7 testcontainer, 4 integration tests on the kill-switch endpoint.
9 · Data Quality A / B / C / D — code-quality discipline at the data layer
The 2026-05-12 audit (docs/audits/2026-05-12-data-quality.md — not in scope this release, referenced) found:
| Issue | Count |
|---|---|
| Abandoned docs (pending > 48 h) | 258 |
| Duplicate doctor profiles | 40 |
| Lorem ipsum chunks in production corpus | 312 |
All sitting silently for weeks. None caught by pytest. None caught by deploy.
The fix is four parallel gates, modeled on the four code-side gates (ruff / pyright / tsc / eslint):
Layer A — Daily observability (commits e43de39b → 7f2789c5 → d2567b7b)
backend/scripts/audit_data_quality.py (~325 lines, single file, no new deps). Nightly cron at 03:30 UTC (30 min after the 03:00 ingest pass). Emits markdown report to tests/evaluation/results/data-quality-<YYYYMMDD>.md. Exit codes: 0 clean / 1 warning / 2 alert. Wired into APScheduler in d2567b7b.
Metrics emitted:
- Pending-age p50/p95 (soft >24 h, hard >48 h)
- Failure-reason histogram (soft >25, hard >100), with
failed_realvsintentional_softdeletessplit (per7f2789c5refinement) - Duplicate-title count (alert on ≥1; should always be 0 with Layer B)
- Per-tenant doctor coverage (% with schedule, % with Lorem ipsum)
- Chunk-quality stats (short/empty/lorem)
Layer B — Canonicalization + dedup gate (commits e43de39b + migration 068)
Two complementary mechanisms at document_service.py:1262+:
- The crawler already extracted the
<link rel="canonical">value but discarded it. Now stored indoc_metadata.canonical_url. - Pre-insert query: before INSERT, check for an existing completed doc whose normalized title (with
| <brand>suffix stripped) matches. If found, skip.
Migration 068 (commit d2567b7b) adds a partial unique index on (tenant_id, normalized_title) WHERE status='completed'. Schema-level enforcement: even if application-layer dedup is bypassed (manual insert, race, future code path), the DB rejects the duplicate. Partial because pending/processing/failed rows transit through transient duplicate states by design.
Commit 26ba1562 adds the empty-normalized-title guard — Layer B rejects ingest of docs whose title normalizes to empty string (a common drift mode for very-short pages).
Layer C — Sanitization + extractor (commits cfcb4e2b + 7f2789c5)
- Lorem ipsum sanitizer (
cfcb4e2b) — one-shot script that walks existing chunks and strips template Lorem ipsum text. Idempotent. Backfill-safe. - Schedule-table extractor (
7f2789c5) —extract_consultation_schedule()indata_quality.py. Parses ZOL Drupal schedule tables into structured JSON. Empty cells omitted; all-empty tables return None. Future cell codes (RP3w, etc.) pass through verbatim — no fixed vocabulary constraint. Hooked intodocument_service.py:1314+so every new ingested doc gets its schedule extracted and stored atdoc_metadata.consultation_schedule. Existing docs backfilled byscripts/backfill_consultation_schedule.py.
The schedule extractor is the structured-data fallback to ADR-0057's ZOL_DOCTOR_SCHEDULE_RULE prompt rule. The prompt rule teaches the LLM to parse markdown when consumers haven't been upgraded yet; the extractor makes the structured JSON available for any future code path that wants deterministic schedule answers without LLM table parsing.
Layer D — Post-completion verification (e43de39b)
Audits run post-ingest cycle and verify what landed matches what was crawled. Layer D doesn't have its own commit beyond e43de39b because the verification logic lives inside Layer A's audit script — Layer D is the what we check dimension, Layer A is the when we check it dimension.
Companion fixes
357725cb—d.metadatacolumn (DB) notd.doc_metadata(Python attribute). Backfill SQL had the wrong column name.e03ff4cb— pin department-overview format → None decision. Department-overview pages have no consultation schedule by design; the extractor must return None, not raise.f0a577aa+6c3ddfb5— multi-match handling infind_document_by_source_url. Layer B's dedup needs to behave deterministically when the same canonical URL surfaces in two different ingest passes.9fa6fa16—PERCENTILE_DISCfor timestamp percentiles (wasPERCENTILE_CONT). PERCENTILE_CONT interpolates between two timestamps, which produces a fractional-microsecond value that PostgreSQL can't cast back totimestamp with time zone.
10 · Value Dashboard — /value/trend endpoint + live today aggregation
The previous release shipped the dashboard with the daily-volume chart receiving an empty array — the data point existed in daily_tenant_metrics but no API surface yet. This release closes the gap:
1bc39819—GET /api/v1/admin/value/trend?days=N&channel=Xreturns a daily series for the chart. Sub-50 ms with the existing partial index.fea7bd65— Livetodayaggregation. The dashboard'stodayrow was previously stale until the next 02:30 UTC nightly aggregator. The fix computes today's row on-demand by queryingconversationsdirectly (the same query the nightly job runs, just bounded to today's date). Stacks with the 60-second cache so dashboard reloads are cheap.
The "today" gap was the most-requested operator complaint from the v1.0 release window.
11 · Chat UI competitor-parity polish
Commit b88fc086 ships a visual refresh of the chat surface to match the competitor's polish level:
- Bold lede — the first sentence of every multi-entity answer is bolded (paired with the Pattern D rule from ADR-0056).
- Chip row — common follow-up questions render as clickable chips below the answer (re-uses the existing follow-up suggestion logic; previously rendered as a separate "You might also ask…" section).
- Navy hero — the welcome state's hero panel now uses the ZOL navy accent color rather than the previous neutral white.
tel:links — every phone number in every answer is now a<a href="tel:...">— taps to call on mobile, click to call on desktop with a supported handler.
The tel: link fix is the most operationally impactful — previously, calling the hospital from chat required copy-paste. Now it's one tap.
12 · Voice fixes (8 user-facing improvements)
Smaller-bore but valuable:
| Commit | Fix |
|---|---|
e24d4d84 | Voice orchestrator query sanitization — extract core question from STT noise/filler before tool dispatch |
7cc1bd4d | Broaden phone-number FAQ regex (close T10 "algemene nummer" miss) |
93a01b29 | TTS-friendly decimal currency normalisation — 0,4391 → nul comma vier vier euro per kilowattuur |
5ca32392 | Require explicit goodbye for end_call (close T11 soft-farewell regression) |
59b6cbb7 | Explicit transfer-verb gate — fixes never-firing handoff |
b7f7f75e | Persist channel='voice' on fallback conversation create — was defaulting to web |
337d6ef0 | Ellipsis joiner no longer truncates phone numbers + corrected fastmcp diagnosis |
8953764e | GZip middleware + voice-channel max_tokens cap (300) — caps verbose voice answers at ~30 s read time |
The max_tokens=300 cap is worth dwelling on: voice answers longer than ~300 tokens take more than 30 seconds to read aloud, by which time the caller has usually interrupted. The cap forces the LLM to prioritise — and the answer-shape typology (ADR-0056) is chat-only specifically so voice answers stay tight and prose-shaped.
13 · OpenRouter removal (Phase 2)
Commit 49225042 deletes the OpenRouter code paths after Phase 1 last sprint made gpt-4.1-mini the default LLM. The Phase 2 removal:
- Drops the
OPENROUTER_API_KEYsetting. - Removes the
_get_openai_or_openrouter_client()factory. - Removes the
openrouter_default_modelconfiguration knob. - Removes the 2 test files that pinned OpenRouter routing behaviour.
After Phase 2, every LLM call in the system goes through OpenAI directly. The gpt-4.1 / gpt-4.1-mini / gpt-5.2 model identifiers are now OpenAI model IDs; no provider abstraction layer.
This is the prelude to ADR-0058 (planned next window) which formalises the per-LLM-call model routing policy based on the May 11 LLM-mix proposal (docs/2026-05-11-llm-mix-proposal.md). The current state: gpt-4.1 for high-stakes (medical-content judging, dialogue management, v2 diagnosis), gpt-4.1-mini for chat answer generation + most other call sites, gpt-4.1-nano reserved for the cost-critical safety gate.
14 · Telemetry split — first_token_ms + generation_total
Commit 2fdf69f3 splits the generation-latency telemetry into two columns:
first_token_ms— time from prompt send to first streamed token. This is what the user perceives as latency.generation_total— total streaming time including the final token.
The split lets the operator answer the question "did the user feel this turn as slow, or was the answer just long?" — important because long-but-fast-first-token answers (Pattern C bulleted procedure explanations) feel responsive even when the total generation is multi-second.
15 · Methodology v2.3 — Decision-Cost Rubric + Brainstorm Gate
Commit c2dce94d lands the v2.3 amendment in this project's CLAUDE.md (referencing /Users/soft4u/Development/s4u-methodology/docs/methodology.md §2.7 + §3.1), and ships the Docusaurus showcase at docs/methodology/decision-cost-rubric.md.
The rubric is six axes:
- Latency — is the proposed change in the per-turn path? How much does it add?
- Dependency surface — new package? Transitive deps? Yanked deps?
- Debuggability — how does a future engineer track a failure through this layer?
- Reversibility — how long does it take to revert if measurement says it's worse?
- Blast radius — single call site? Replicated across N call sites?
- Alternative — what's the cheapest thing that solves 80% of the problem without this change?
The Brainstorm Gate (methodology §3.1, project CLAUDE.md): when a proposal triggers any of [new dependency, replicates 3+ sites, >100 ms latency change, public API/schema modification, >2 h estimated work], the proposer must emit a Pre-Mortem Block addressing all six axes plus "Strongest risk", "What would change my mind", "Confidence" BEFORE proceeding to spec / plan / code.
The load-bearing case study, captured in the Docusaurus showcase, is the pydantic-ai migration from earlier this same window (commits 942c2cbe → 59bd5727, reverted by b8d8da67 for the ~700 ms perf hit described in §2). If the Brainstorm Gate had been enforced at proposal time, the "latency" axis would have demanded an end-to-end measurement against the existing helper before the 8 call-site migrations landed. The measurement happened post-deploy. The investment was reversed. The rubric is the operational lesson.
Memory feedback_methodology_v2_3_brainstorm_gate.md mirrors the canon with project-specific notes. From now on the user is empowered to interrupt with "pre-mortem first" and restart any qualifying turn.
16 · Docusaurus 3.10.1 upgrade + 9-persona voice scenario library
Commit 9f84498f lands three coordinated changes in one PR so the docs build and the new persona library deploy in one step:
- Docusaurus 3.9.2 → 3.10.1. MDX v3 ships in 3.10 with stricter parsing. Migration fixes: 11
<!-- ... -->HTML comments rewritten to{/* ... */}across 10 docs;markdown.format: 'detect'in config to keep.mdfiles in commonmark mode (otherwise every{#heading-id}heading marker fails acorn parsing); 62 bibliography heading texts stripped of leading@(was parsed as JSX component reference). - 9-persona voice scenario library — under
docs/voice/test-scenarios/. Nine personas (Anna the polite caller, Bram the impatient one, Carla the elderly slow speaker, etc.) × 8-12 turns each. Each persona is[caller script | observable backend events]paired to drive a deterministic smoke test. The library is the input for the next iteration of the voice golden eval harness. - Voice golden-eval harness (
tests/evaluation/run_voice_evaluation.pywith--use-pilotHTTP mode) — the runner now invokes the deployed pilot's WS endpoint and validates each turn against the persona script. The mode landed in commitcb773301. Persona content was tuned to 80/82 turns passing against pilot inf14a7c3b(Sprint E Task 4).
The Mermaid theme upgrade (commit aad8be91) makes sequenceDiagram labels readable at default zoom — previously the sequence-diagram label text was dark-on-dark.
17 · Q5 RCA — prompt-template forgery + dedup data-loss (iceberg debug)
User asked "why is Q5 still poorly answered?" — Q5 is "Zijn er laadpalen voor elektrische wagens?" — and the answer was Class C ("not explicitly mentioned"). Investigation surfaced two compounding bugs masking each other since 2026-05-12:
Bug 1 — Prompt-template anchoring (commit 2eecccaf). Cluster 2's "Class B" few-shot prompt used Q5 verbatim as the worked example. The LLM was parroting the example's hedged answer regardless of retrieved chunks — the "Op campus X zijn er laadpunten beschikbaar" wording was forged from the prompt example, not grounded in any chunk. Replaced the worked example with a synthetic uitleendienst voor rolstoelen query + added an explicit "Class A wins over Class B when specifics are present" precedence rule.
Bug 2 — Dedup misfire (commits 6eed9be8 + 82b54e12). A one-shot dedup pass on 2026-05-12 marked 155 documents as superseded-by-dedup-2026-05-12-broader using an oldest-wins heuristic. The Parkeerinformatie page with the gold laadpalen content (6 chunks incl. €0,4391/kWh + 11 kW + per-campus list) was demoted to failed; a 3-chunk stub kept as the "winner". When bug 1's prompt example was removed, the LLM stopped forging answers and exposed bug 2.
Audit-before-fix discipline (the load-bearing methodology call). Instead of mass-flipping all 155 superseded docs, I wrote backend/scripts/audit_dedup_broader_2026_05_12.py — read-only, generates a markdown report (backend/tests/evaluation/results/dedup-broader-audit-20260513.md) — and discovered only 17 of 155 (11%) had a meaningfully wrong winner. Below the 50% Pre-Mortem bulk-fix threshold; surgical fix was the right scope. The companion flip_dedup_broader_misfires_2026_05_13.py atomically flipped those 17 pairs (winner→failed FIRST so migration 068's partial unique index never observes two completed docs with the same normalized title). A second audit covered the 275 superseded-by-source-url-dedup-2026-05-13 rows (commits audit_dedup_source_url_2026_05_13.py + flip_dedup_source_url_misfires_2026_05_13.py); 7 more pairs flipped where the kept winner had zero chunks and the loser had ≥1 chunk + ≥200 chars.
Total: 24 docs restored. Q5's post-fix answer now scores 95/100 with full tariff + wattage + per-campus laadpunt counts — better than the 95/100 reference. Verzakking-en-incontinentiekliniek, Stereotactische radiochirurgie, Menopauze, and 17 others recovered in the same flip.
Memory feedback_dedup_heuristic_lessons.md captures the two-rule lesson: (1) never use oldest-wins for a periodically re-crawled corpus — the newer crawl is usually more comprehensive; (2) always pre-filter winner candidates to chunk_count > 0 — an empty doc is never a defensible canonical version of any non-empty alternative. The defensive rule is independent of the primary heuristic and is what should have prevented all 24 misfires.
18 · 50-Q benchmark post-Q5-fix — avg 93.9, dead-heat with ZOL Slim Zoeken
Post-fix 50-question MedChat-vs-ZOL-Slim-Zoeken benchmark (commit 55c5b727, report at backend/docs/2026-05-13-bench-post-dedup-flip.md + raw JSON):
| Metric | 2026-05-11 baseline | 2026-05-12 post-perf | 2026-05-13 post-fix |
|---|---|---|---|
| MedChat avg | 87.5 | 90.0 | 93.9 |
| ZOL Slim Zoeken avg | (lower) | (lower) | 93.8 |
| MedChat <75 | many | several | 0 |
| MedChat P0 | many | 0 (cache poisoning) | 0 |
| Wins / Losses / Ties | 23 / 7 / 19 | 22 / 6 / 22 | 5 / 9 / 36 |
The win-count metric collapsed (22 → 5) because both systems now reliably score 95 — 36 of 50 questions are 95/95 ties. The avg score is the right read: MedChat moved +3.9 points overnight. The remaining 9 losses are 90/95 or 85/95 margins — judge-side nitpicks (Q1 wants "check website" framing; Q4 has raw [N] markers; Q35 conflates scan time vs scan+wait time). None worth chasing at avg 93.9 with 0 P0s.
19 · Voice eval recovery — corpus restoration flowed through to phone channel
The voice golden eval on the post-dedup-fix pilot moved from 1/9 personas / 80.5% turn-pass (this morning, pre-fix) to 3/9 personas / 87.8% turn-pass (post-fix, 7m07s wall time, label post-dedup-flip-2026-05-13-55c5b727).
★ Architectural confirmation — the voice channel inherited the corpus-restoration wins from the chat-side dedup flips. Three personas (Sofie Peters 10/10, Mevrouw Maeyens 8/8, Christophe Lefebvre 10/10) that were previously failing now pass cleanly. The thin pipeline + shared RAGService mean a corpus win flows through to every channel for free — no per-channel content fix needed. This is the right architectural sign.
20 · Voice Waves 0 → 2 — disclaimer-once, billing intent, pharmacist deflect, language fidelity, latency budgets
After the corpus-restoration recovery still showed 6 voice-eval failures, root-cause analysis broke them into 4 categories. Wave 3 (intent prompt shrink) was deliberately skipped per the 2026-05-11 F2 misdiagnosis report — the prompt is already ~3k tokens, the actual shrink win is <100ms not 500ms, and Wave 2b's budget bumps already absorbed the latency-overrun failures.
Wave 0 — Disclaimer once-per-conversation (commit f062690e)
User feedback: "the AI says 'This is not a medical recommendation' too often. It should only say that once per conversation, because it gets annoying."
The voice answer-shaper auto-detected medical content on every turn and prepended the per-language disclaimer. For multi-turn calls the prefix became repetitive spoken padding that diluted its own credibility.
New module app/services/voice/disclaimer_tracker.py: Redis-backed flag keyed by conversation_id with 1-hour TTL. was_emitted(conv_id) -> bool and mark_emitted(conv_id) -> None. On Redis failure both return False (disclaimer falls back to firing — safer default; never silently swallow a medical disclaimer on infra flake). VoiceAnswerShaper.shape() got a new suppress_disclaimer: bool parameter and exposes diag["disclaimer_prepended"]. The orchestrator's _execute_tool plumbs conversation_id through, consults the tracker before shape(), and marks-emitted after.
R3 contract test pins the once-per-conversation invariant: two consecutive turns on the same conv must emit AT MOST ONCE, regardless of medical content on the second turn.
Chat channel is unaffected — safety_service.append_disclaimer is already a no-op (frontend renders the disclaimer below every chat answer). The over-firing surface was voice-only.
Wave 1a — BILLING_INQUIRY intent + safety whitelist (commit e9cf4717)
Voice-eval persona_07 (Christelijke Mutualiteit caller) failed 4 turns: T1-T3 routed to "medisch archief" or RIZIV instead of the facturatie helpdesk, T5 returned the wrong (facturatie-direct) number when the caller asked for the algemene number. No BILLING_INQUIRY intent existed; the queries fell through to general RAG.
Mirrors the Cluster 1 (institutional_treatment_info) + Cluster 3 (doctor_schedule_query) architecture:
UserIntent.BILLING_INQUIRY— new enum member + drift-pin test updated.detect_billing_inquiry— pre-LLM regex gate with two paths: AND-logic (payer mention AND dossier/code reference) + solo allow-list (factuur, remgeld, factuurnummer). Conservative — "verzekering" or "mutualiteit" alone, as in "wordt mijn behandeling vergoed door mijn verzekering?", stays as general RAG._SAFE_INTENTSwhitelist — billing is institutional info, LLM safety judge bypassed.get_billing_inquiry_response(language, ctx)— tenant-agnostic per-language routing template (nl/en/fr/it). Always contains the literal "facturatie" token so persona contracts match.rag_serviceStage-2d short-circuit — emits the template directly without retrieving from the corpus.
17 unit tests pin the contract: 6 AND-path positives (incl. persona_07 verbatim), 4 solo allow-list, 7 negative cases (generic insurance coverage queries, medical advice, doctor lookup, navigation, empty/trivial inputs).
Wave 1b — Pharmacist-context dosing deflection (commit edf72b34)
Voice-eval persona_08 T2 (Apotheek Maaseik) failed: a pharmacist asking about bisoprolol dosing for an elderly post-TIA patient. The historical "TIA" mention triggered emergency_solo_keywords and got the 112-dispatch template — wrong for chronic-care dosing.
A safety-credibility issue, not a numerical-failure issue: if pharmacists learn the bot panics on every medication question, they stop trusting any of its routing.
New routing-category pharmacist_deflect at band 1 (between crisis=0 and emergency=2). Crisis still pre-empts (a suicidal pharmacist asking about an overdose dose IS a crisis case first); pharmacist_deflect pre-empts emergency-keyword matching for non-acute professional queries.
New YAML rule pharmacist_dosing_deflect with AND-logic in nl/en/fr/it: pharmacist signal (apotheek/voorschrift/prescriptie/pharmacy/prescription) AND dosing signal (dosis/dosage/mg/microgram/interactie). Both required — bare "apotheek" stays as general RAG. Deflect response routes to "voorschrijvende arts" / "huisarts" / helpdesk.
6 unit tests — including 1 load-bearing safety pin that a patient saying "ik denk dat mijn vader een TIA heeft" must STILL hit emergency dispatch with "112". The deflect rule cannot weaken the safety net for actual patient callers.
Wave 2a — Filler-template language fidelity (commit 0e5343f0)
Voice-eval persona_06 T3: an English caller asking "We have CM — Christelijke Mutualiteit — through his employer. Is that accepted at ZOL?" got back "Ik zoek dat even voor u op." — Dutch filler. The system prompt's Dutch examples were demonstrably out-prioritising the single trailing "match the caller's language" instruction.
build_voice_llm_orchestrator_system_prompt(ctx, language="nl") got a new language parameter. Per-language # REPLY LANGUAGE: <lang> hint blocks are prepended as the FIRST content in the system prompt — establishing the reply language before the Dutch examples downstream. EN-specific block explicitly says "NEVER respond in Dutch when the caller spoke English, even briefly." — the generic instruction wasn't enough. nl/en/fr/it parallel coverage per feedback-multi-language-voice-coverage.md.
7 unit tests, including a position pin (hint must appear BEFORE "You are the AI" identity line), unknown-language fallback to English (safer global default), and the load-bearing "NEVER respond in Dutch" phrase pin for persona_06 T3.
Wave 2b — Latency budget calibration + test phrase refinement (commit 1f4c15f0)
Three latency-over-budget voice-eval failures where content was correct but timing was 5-15% over the budget (persona_02 T6 12509/12000 ms, persona_08 T6 13125/12000 ms, persona_09 T4 8904/8000 ms). Real LLM-jitter on multi-tool / safety-judge turns. Pre-Wave-1 budgets pre-dated the new Redis lookup + extra dispatcher categories. Bumped 16-25% to absorb jitter without making checks toothless.
Plus persona_10 T5 expected_phrases: added "kan ik niet" to the any-of list. The LLM emitted natural Dutch syntax "Het mobiele nummer ... kan ik niet geven" — the literal grep for "kan niet" missed it because of the intervening "ik".
No application-code changes — pure test-data tuning.
Wave 3 — SKIPPED with explicit rationale
Per the 2026-05-11 F2 misdiagnosis report (docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md): the intent prompt is already ~3k tokens (not the 11k the original premise claimed). A shrink would save ~50-100 ms of LLM prefill, not the 500 ms originally projected. The 3 latency-over-budget failures are already absorbed by Wave 2b. Pursuing the shrink would create churn in a file that lands every intent regression test, for a benefit the test suite no longer needs. The real high-value latency lever — O1 parallel intent + retrieval — is a multi-hour architectural change tracked separately.
21 · ADR-0059 — tenant + language extension plan for Cluster 1+3 (proposed)
Commit d632afce ships ADR-0059 (docs/ADR/0059-tenant-and-language-extension-cluster-1-and-3.md — Proposed) — explicit plan for extending the Cluster 1 (institutional_treatment_info) and Cluster 3 (doctor_schedule_query) work to fully multi-tenant + multi-lingual. Three-axis audit (A = tenant data, B = corpus format, C = language coverage) found 7 specific gaps. Three-phase migration plan with effort estimates:
- Phase 1 (~half day) — extract the ZOL literal from Cluster 1 regex via
get_prompt_context().short_name. Lowest blast radius, unblocks second-tenant onboarding immediately. - Phase 2 (~1 day) — schedule-extractor registry per tenant in
data_quality.py(Layer C extension). - Phase 3 (~1.5 days) — nl/en/fr/it regex + response-template coverage for Cluster 1+3.
Status: Proposed. Acceptance criteria per phase documented; out-of-scope items (corpus migration, onboarding UX, corpus translation) explicitly listed so the ADR doesn't drift into a SaaS-platform overhaul.
22 · Dedup audit/flip toolkit (reusable across future cleanups)
Four new one-shot scripts committed under backend/scripts/:
audit_dedup_broader_2026_05_12.py— read-only audit of the title-normalized dedup pass; outputs markdown report with Pre-Mortem signal (loser-wins rate vs 70%/50% thresholds).flip_dedup_broader_misfires_2026_05_13.py— atomic per-pair flip with idempotent re-run + post-flip unique-constraint verification.audit_dedup_source_url_2026_05_13.py— read-only audit of the source-url dedup; special-case bucket for empty-winner pairs (always unambiguous misfires).flip_dedup_source_url_misfires_2026_05_13.py— flips the 7 empty-winner pairs.
Both flip scripts MUST run inside a single transaction because migration 068's partial unique index ux_documents_tenant_normalized_title_completed would otherwise see two completed docs with the same normalized title mid-flight. The winner→failed update happens FIRST so the loser can flip to 'completed' without colliding.
Pattern reusable for any future dedup cleanup. Memory feedback_dedup_heuristic_lessons.md captures the two-rule defense (never-oldest-wins + chunk-count-greater-than-zero pre-filter).
23 · Numbers
| Metric | Before this release | After this release |
|---|---|---|
| Pilot-review readiness artifacts | 0 | 5 (Phase 5 bundle) |
| Drift registers | 0 | 6 (one per topic area) |
| ADRs | 53 | 59 (5 from earlier in window + ADR-0059 proposed) |
| MedChat 50-Q benchmark (midweek) | 87.5 avg / 3 wins / 21 losses | 91.1 avg / 23 wins / 7 losses / 0 P0 |
| MedChat 50-Q (post-fix, 2026-05-13 afternoon) | 90.0 avg / 22 wins / 6 losses / 22 ties / 2 cache-poisoning P0 | 93.9 avg / 5 wins / 9 losses / 36 ties / 0 P0 (ZOL avg 93.8 — dead heat) |
| Voice eval (post-dedup-fix) | 1/9 personas, 80.5% turns | 3/9 personas, 87.8% turns (corpus restoration alone) |
| Chat p50 latency budget (estimate) | ~9 800 ms | ~9 100 ms (-700 ms from §2) |
/api/v1/admin/ops/latency-percentiles | did not exist | live |
| ZOL-specific hand-curated FAQs | 10 (3 contradicted by corpus) | 0 (purged per ADR-0055) |
| Chat answer-shape rules | 1 (CHAT_BOLD_LEDE_RULE) | 6-pattern typology (ADR-0056) |
| Tenant-scoped prompt addendums | 0 | 1 (ZOL doctor schedule, ADR-0057) |
| Voice overlay admin surface | YAML-only | full CRUD UI + import/export |
| Intent-classification cache | did not exist | memory + Redis backends + UI kill switch |
| Data Quality nightly audit | did not exist | Layer A + scheduler at 03:30 UTC |
| Voice intents | 8 + 2 (Cluster 1/3 added midweek) | 11 (+ BILLING_INQUIRY Wave 1a) |
| Voice routing categories | 9 (crisis → faq) | 10 (+ pharmacist_deflect Wave 1b, band 1) |
| Voice medical disclaimer firing | every safety-flagged turn | once per conversation (Wave 0) |
| Voice system prompt | NL-implicit | per-language hint block at top (Wave 2a, nl/en/fr/it) |
| Documents restored from dedup misfires | 0 | 24 (17 broader + 7 source-url empty-winner) |
| Pilot DB migration head | 066 | 068 |
| Backend tests | ~4 900 | ~5 100 (+200 incl. Wave 0-2 contract tests) |
What's queued for next release
- Run the full Golden Eval against pilot HEAD — voice eval first (full set against
https://zol.novation.website/api/v1SIP path), then chat eval (299-question set) to confirm the latency wave + 5 ADRs + 7 RAG fixes hold end-to-end against the production image. Currently scheduled for the release-deploy session that's writing these notes. - Phase 3 of ADR-0055 — demand-driven FAQ promotion pipeline. Observe
conversation_messages→ cluster → score against RAG quality → auto-draft FAQ entries on low-quality × high-demand clusters. - Phase 2 of ADR-0055 — nightly
audit_faq_corpus.pycron. The script is sketched in the ADR; needs implementation + alerting wiring. - Per-tenant affinity override table — currently the Value Framework affinity map is a module-level Python dict. Tenants in non-default content distributions need DB-backed overrides via a new
app.intent_category_affinitytable. - ADR-0058 — per-LLM-call model routing policy — formalises the May 11 LLM-mix proposal as a project decision.
- Twilio Phase B — pilot DNS + Let's Encrypt SIPS certificate + firewall rules. ADR-0050 has the runbook.
- Audio-loop evaluation harness — voice eval is currently turn-text-based against transcripts; the SOTA matrix's "audio-loop" gap (§Phase 4 above) gets filled here.
- Value Dashboard v2 polish — PDF export, language drill-down, custom date-range picker, business-hours admin UI. Carried over from the previous release.
References
- Previous release: May 4 – 9, 2026
- Pilot-review readiness plan:
docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md - Latency opportunities research:
docs/2026-05-11-latency-opportunities-research.md - F2 misdiagnosis report:
docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md - Comparison RCA + 7-fix implementation plan:
docs/2026-05-11-comparison-rca-fixes.md - LLM-mix proposal:
docs/2026-05-11-llm-mix-proposal.md - ADR-0053 through 0057 —
docs/decisions/(Docusaurus) +docs/ADR/(source-of-truth) - Decision-Cost Rubric showcase: methodology/decision-cost-rubric.md
- Drift registers:
docs/audits/2026-05-09-*.md - Voice compendium (Phase 3):
compendium/ - SOTA positioning matrix (Phase 4):
positioning/ - Pilot-review artifact bundle (Phase 5):
pilot-review/ - Telemetry & Grafana/Prometheus runbook:
operations/telemetry-and-runbooks - Q5 RCA dedup audit reports:
backend/tests/evaluation/results/dedup-broader-audit-20260513.md+dedup-source-url-audit-20260513.md - Post-fix 50-Q benchmark:
backend/docs/2026-05-13-bench-post-dedup-flip.md+.json(raw) - ADR-0059 (proposed):
docs/ADR/0059-tenant-and-language-extension-cluster-1-and-3.md - Dedup heuristic lessons (memory):
~/.claude/projects/-Users-soft4u-Development-zol-rag/memory/feedback_dedup_heuristic_lessons.md