Skip to main content

Release Notes: May 9 – 13, 2026

Pilot-Review Readiness · Autonomous Latency Wave · 5 New ADRs · Voice Overlay Admin · Q5 RCA · Voice Waves 0-2

~155 commits | 5 days | 5-phase docs initiative | ADRs 0053→0057 (accepted) + ADR-0059 (proposed) | autonomous latency wave (-700ms/call) | Voice Overlay Admin (Sprint E A→D) | Data Quality A/B/C/D | methodology v2.3 Brainstorm Gate | Q5 RCA: 24 docs restored from dedup misfires, MedChat 50-Q avg 87.5 → 93.9 | Voice Waves 0-2: disclaimer once-per-conversation, BILLING_INQUIRY intent, pharmacist-deflect category, per-language reply hint, latency-budget calibration

This release is the largest single window since the project pilot began — three times the commit volume of May 4-9 and qualitatively different in shape. Where the previous window was a sprint of new features (Value Dashboard, Value Framework), this one is a sprint of maturation: re-aligning docs to code after months of architectural drift, codifying five overdue decisions as ADRs, dropping ~700 ms of synchronous wait from every chat turn, and shipping the editable admin surface that turns the voice overlay system from "engineer-only YAML" into "hospital-admin clickable UI".

The headline themes:

  1. Pilot-Review Readiness initiative — 5 phases, ~7 000 LOC of documentation work. Audit (Phase 1, 6 drift registers) → cascade fixes (Phase 2, four sub-batches) → voice compendium (Phase 3, transferable white paper) → SOTA positioning matrix (Phase 4, 18 vendors × 8 axes) → pilot review artifact bundle (Phase 5, 5 reviewer-ready artifacts). The project's docs and code are now in alignment for the first time since the November 2025 voice cut.
  2. Autonomous latency wave (O3 / O4 / O5 / O10 / O12 / O16) — p50 chat latency now budget-bound by retrieval, not by prompt assembly or telemetry writes. Six independent fixes shipped from a single research report: lru-cached prompt assembly, singleton AsyncOpenAI clients, batched executemany for pipeline_telemetry, pre-warmed prompt cache for nl/en/fr/it at startup, expanded _SAFE_INTENTS to skip the LLM safety judge on procedural answers, and the new /api/v1/admin/ops/latency-percentiles operator endpoint. The biggest single win — ~700 ms per call — came from replacing the pydantic-ai migration (yes, the one that landed earlier in this same window) with a thin structured_call helper.
  3. Five new ADRs — 0053 through 0057. ADR-0053 retroactively documents the Neo4j removal (16 000 LOC deleted in March, finally written up). 0054 codifies the intent classification cache with Redis backend + admin kill switch. 0055 declares the FAQ-corpus drift prevention policy that purges 10 hand-curated ZOL FAQs in favour of the corpus. 0056 ships the chat answer-shape typology — six shape patterns instead of sixty per-defect rules. 0057 introduces tenant-scoped prompt addendums + tenant-agnostic doctor-profile boost — the right-layer-of-abstraction pattern for new hospital onboarding.
  4. Voice Overlay Admin — Sprint E Waves A through D shipped end-to-end. Waves A/A.5 unified the voice routing rules and folded the medical taxonomy into the tenant overlay; Wave B exposed a read API; Wave C built the viewer UI; Wave D added the full edit / delete / inline-edit / YAML import/export surface plus the empty-tenant onboarding banner. Hospital admins can now CRUD their own voice overlays without touching YAML.
  5. OpenRouter removal completed. Phase 2 deleted the OpenRouter code paths after Phase 1 (last sprint) made gpt-4.1-mini the default.
  6. Data Quality A / B / C / D — code-quality discipline applied to the data layer. Layer A = nightly audit script + scheduler. Layer B = canonicalization + dedup gate at ingest, schema-enforced via migration 068 partial unique index. Layer C = Lorem ipsum sanitization + schedule-table extractor (structured JSON from ZOL Drupal tables). Layer D = post-completion verification. Driven by the 2026-05-12 audit that found 258 abandoned docs, 40 duplicate doctor profiles, 312 Lorem ipsum chunks sitting silently for weeks.
  7. Methodology v2.3 — Brainstorm Gate. Six-axis Decision-Cost Rubric and Pre-Mortem Block — the project rule that started enforcing itself this week after the pydantic-ai investment had to be undone.

1 · Pilot-Review Readiness — 5-phase doc/code re-alignment

The plan (docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md, commit b50dae96) was conceived as a single 4-working-day arc with five sequential phases. Each phase had a clear acceptance criterion; each phase output is now shipped as Docusaurus pages.

Phase 1 — Code↔doc audit (6 parallel drift registers)

Phase 1 produced six drift registers under docs/audits/, one per topic area, all run against master tip b50dae96. Read-only audits — no code touched.

AuditFileResult
Voice docs (21 pages)2026-05-09-voice-docs.md14 🔴 / 12 🟡 / 9 🟢
Architecture docs2026-05-09-architecture-docs.md9 🔴 / 17 🟡 / 10 🟢
RAG docs2026-05-09-rag-docs.md22 🔴 / 19 🟡 / 12 🟢
ADR register (51 ADRs)2026-05-09-adr-register.md9 🟡 / 3 🟠 / 2 🔴 / 4 ⚫
API surface (248 routes)2026-05-09-api-surface.md5 🔴 / 27 🟡 / 8 🟢
Frontend docs ↔ UI2026-05-09-frontend-docs.md8 🔴 / 11 🟡 / 6 🟢

The three load-bearing voice findings tell the story: local-setup.md Step 7 imported a deleted VoiceOrchestrator; conversational-intent.md documented a three-tier resolver that no longer exists; triple-defense.md described Layers 1+2 modules that were both deleted in the May 2 thin-pipeline cut. A new developer reading these pages would build a mental model of a system that hasn't existed for two months.

Phase 2 — Documentation excellence cascade (batches 2a → 2d-b6)

Phase 2 fixed the red entries in four sub-batches, each landing in its own commit:

  • 2a — Architectural ground truth (c9602830). ADR-0053 (Neo4j removal) backfilled — see §3 below. ADR-0017 amendment to mark Stage 2c as deprecated. Bibliography + ADR index rebuilt.
  • 2b — Cascade doc fixes (4a975dd8). BGE-M3 references replaced with text-embedding-3-large; legacy 8-stage voice pipeline references replaced with the thin pipeline; chunk-direct citation pipeline documented; BibTeX bibliography rendered correctly under Docusaurus 3.10.1's MDX v3.
  • 2b-prime — Close cascade gap + 4 ADRs ported (9b037b25). Four orphan ADRs (0050 Twilio + LiveKit SIP, 0051, 0052, 0053) ported into docs/decisions/. Two earlier ports amended.
  • 2c — Safety-critical revalidation (6f750c4b). Empirical fast-gate study (raised threshold 0.40 → 0.50 in Wave 2.C.1). Voice safety rewrite. Auth doc fix. Medical-content disclaimer reactivated on voice via post-LLM detection (0a67fa65).
  • 2d-b1 → 2d-b6 — Academic rewrite pass. Six tight batches across architecture, voice, RAG, safety/decisions, thesis/evaluation, and final batch. Cumulative: +27 new bibliography entries, ~50 page rewrites in an academic register, Mermaid theme upgrades for legibility.

Phase 3 — Voice stack compendium

docs/compendium/voice-stack.md — a 10 507-word transferable white paper covering the full voice pipeline from STT through dialogue management (and its deletion!) through TTS, tenant overlays, value framework, and Twilio LiveKit SIP. Designed to be read by an external engineer evaluating whether to license the voice stack for a phone-support / appointment-booking spinoff. The compendium is self-contained — every concept is defined inline; no Docusaurus internal links to pages that might churn.

Phase 4 — SOTA positioning matrix

docs/positioning/sota-matrix.md — 18 vendors (Retell, Vapi, Cognigy, OpenAI Realtime, Microsoft Healthcare Bot, Hyro, Twilio Engage, Voiceflow, Pinecone Healthcare, ElevenLabs Conversational, Daily.co Bots, LiveKit Cloud, etc.) × 8 axes (latency, hallucination rate, citation density, Dutch quality, multi-tenant isolation, EU residency, AI Act readiness, safety architecture). The matrix is opinionated — every cell has a citation or a "no public claim" marker, and the gap section is the part the user-facing pilot deck quotes.

Three differentiators that survived honest scrutiny:

  • Citation density per claim (Pattern C / D / E / F enforces per-bullet markers — see ADR-0056). No competitor inspected emits per-bullet inline [N] markers on health queries.
  • Multi-tenant safety architecture (medical-content disclaimer + crisis dispatch + AI Act §50(2) compliance enforced at the layer the LLM cannot bypass).
  • Hospital-agnostic Value Framework (intent × category affinity rerank). No competitor inspected applies category-typed rerank to LLM context selection.

Three honest gaps that survived:

  • STT quality on Dutch dialects still trails Deepgram NL-FL (current vendor) by ~3 WER points vs Azure Speech NL.
  • Audio-loop evaluation harness — voice eval is currently turn-text-based; competitor benchmarks include audio-fidelity loops we haven't built.
  • Per-tenant affinity overrides — the matrix is currently a module-level Python dict; multi-tenant production needs a DB-backed override table.

Phase 5 — Pilot review artifact bundle

docs/pilot-review/ — five reviewer-ready artifacts. Designed for a pilot-customer exec sponsor's pre-meeting reading window of ~20 minutes.

ArtifactLengthAudience
pilot-deck.md1 652 wordsExec sponsor (10 slides as markdown)
architecture-one-pager.md804 wordsTheir CTO (Mermaid layered stack + per-turn sequence)
demo-script.md1 922 wordsAnna Verstraeten persona; 7 worked scenarios sourced from GQ-001 / GQ-008 / GQ-017 / etc.
engineering-rigor.mdincludedTest-coverage matrix, ADR count, methodology v2.3 mention
q-and-a-prep.mdincludedAnticipated questions + honest answers

After Phase 5 commit fea42924, the initiative is COMPLETE. Total cost: ~7 000 LOC of documentation work (audits + cascades + compendium + matrix + bundle), zero production code changes from this phase alone.


2 · Autonomous latency wave (O3 / O4 / O5 / O10 / O12 / O16 + pydantic-ai swap)

The research report that paid for the wave

docs/2026-05-11-latency-opportunities-research.md (commit 01f59be0) is an end-to-end profile of the chat pipeline with per-stage p50 / p95 measurements. The report identified eight independent optimization opportunities (O3 through O16) and ranked them by (impact in ms) × (reversibility) ÷ (engineering cost). The wave executed the top-six during the night of 2026-05-11 → 12, autonomously, on a fresh branch.

What landed

IDCommitSavingPattern
O3aae3c2a4~120 ms medianExpand _SAFE_INTENTS to skip the LLM safety judge on three procedural answer classes (appointment_scheduling, navigation_or_practical_info, general_chitchat) that cannot semantically be medical advice
O47786a20c~30 ms medianSingleton AsyncOpenAI per (api_key, base_url, timeout) — eliminates the per-request HTTPS handshake to OpenAI's edge
O5a48efef9~5 ms × N intentslru_cache(maxsize=128) on build_rag_system_prompt(language, tenant) — the prompt is now built once per tenant per language, not once per query
O10238ef215DB load reductionBatched pipeline_telemetry INSERTs via executemany instead of one INSERT per turn
O12014bf61cCold-start removalPre-warm RAG prompt cache for nl/en/fr/it at startup — the first request after a deploy no longer pays the cold-build cost
O16876bb63fObservabilityGET /api/v1/admin/ops/latency-percentiles?days=N&channel=voice — p50/p95/p99 per stage; the dashboard that quantifies all the above

The pydantic-ai swap — biggest single win

Earlier in this same window, the project migrated 9 LLM JSON-output call sites to pydantic-ai (commits 942c2cbe59bd5727, batches A–F). The migration eliminated manual json.loads + Pydantic.model_validate chains, added native retry-on-validation-error, and removed the _parse_llm_json band-aid chain. It looked clean. It tested clean.

It also added ~700 ms per call on average — because pydantic-ai instantiates a full Agent[None, OutputModel] graph per request, including provider abstraction, tool registry, and OpenAI client setup. That setup cost dominated short prompts (intent classification, query rewrite) where the underlying LLM call was already only ~300 ms.

Commit b8d8da67 replaces pydantic-ai with a thin structured_call(prompt, output_model) helper that:

  • Calls OpenAI directly with response_format=json_object.
  • Round-trips the JSON through model_validate_json (Pydantic v2 native).
  • On ValidationError, makes one retry with the error message appended to the prompt.

Net effect: ~700 ms saved/call on every call site that was migrated. The five migration commits are kept in history — the lesson, codified as memory feedback-llm-mix-only-prompt-shrink-approved.md, is that every migration of an LLM-call helper layer must be measured end-to-end against the previous helper, not against the call-site code it replaced. The pydantic-ai code looked simpler; the wall-clock got worse.

The F2 misdiagnosis report (docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md) is the companion artifact — F2 (intent-prompt shrink) was queued as the next perf win after the pydantic-ai migration, then cancelled when the post-deploy benchmark showed intent_classification p50 = 2 454 ms, three times the research estimate. The latency wasn't in the prompt; it was in the helper. The wrong target was correctly not attacked.

Docker fixes that enabled the deploy

  • 408b0979 — Pip version specifiers must be quoted in shell-redirect contexts in Dockerfile.app. The unquoted form pydantic-ai==1.93.0 parsed as redirection to file 1.93.0 and silently installed nothing.
  • 1a9412d0 — Switched to pydantic-ai-slim[openai]==1.93.0 to drop the yanked mistralai transitive dependency. The full pydantic-ai package depended on it; the -slim variant doesn't.

These two fixes together unblocked the May 12 02:00 pilot deploy that carried the entire wave.


3 · Five new ADRs — 0053 through 0057

The pilot-review audit identified five accepted-but-undocumented decisions that had been operating as production code without an ADR. All five were written up this window. Each is summarised below; the full text lives in docs/decisions/.

ADR-0053 — Remove Neo4j, consolidate graph context onto PostgreSQL taxonomy

Retroactively documents the March 7 primary removal (commit d82b1592, ~16 000 LOC deleted). The motivation, in three lines: PostgreSQL with pgvector + the taxonomy schema provides everything Neo4j gave us (typed-node entity traversal via JOINs); the dual-datastore cost (operational, query-time, conceptual) outweighed the win; the architectural cleanup in May (commit 158d793) finalised the consolidation. Supersedes ADR-006 + ADR-0029. Amends ADR-0017 (Stage 2c graph search → deprecated), ADR-0028 (golden-page Neo4j seeding → no-op), ADR-0030 (entity-extraction now routes to PG, not Neo4j Cypher).

ADR-0054 — Intent Classification Cache

Adds a (tenant_id, normalized_query, language)IntentClassificationResult cache between request entry and the LLM classifier. Two backends:

  • Memory (default, per-worker LRU + TTL) — single-worker pilot config.
  • Redis (opt-in INTENT_CACHE_BACKEND=redis) — shared across worker replicas via the existing app.db.redis connection pool.

Poisoning guards: write only when confidence >= 0.85 and intent != UNKNOWN. The Redis backend's cache survives container restarts — the compensating control is the operator "Clear Cache" button on PlatformSettingsPage which wipes both this cache and the semantic_query_cache in a single click (POST /api/v1/settings/cache/clear).

Cache-hit removes ~2 300 ms from per-turn latency. Stacks with the semantic_query_cache: if both hit, the full pipeline collapses to ~50 ms.

ADR-0055 — FAQ-Corpus Drift Prevention (the FAQ purge)

The 2026-05-12 audit of the 10 ZOL-specific FAQ entries against the live pilot corpus found:

VerdictCount
Directly contradicted by corpus3
Incomplete2
Unverified (no corpus evidence)3
Aligned2

Drift typology: every entry making list claims, service routings, or counter-corpus assertions had drifted within ~3 months. The two aligned entries were single immutable facts (phone, address).

Phase 1 (executed, commit a9820c3f): purge all 10 ZOL FAQs from zol.yaml. Preserve only the four safety/policy entries (crisis_suicide_ideation, three emergency dispatch rules) plus the overnacht_ambiguity clarification.

Phase 2 (planned): nightly audit_faq_corpus.py cron — for each surviving entry, retrieve top-k chunks via RAG, ask GPT-4.1-mini to verdict against the FAQ answer, alert on CONTRADICTED.

Phase 3 (planned): demand-driven promotion — observe conversation_messages → cluster → score against RAG quality → auto-draft FAQ entries on low-quality × high-demand clusters. FAQs become fresh by construction instead of hand-authored guesses against last quarter's mental model.

The pharmacy incident that triggered the audit cannot recur — voice and chat now answer identically because both go through the same RAG path, not through the divergent FAQ-then-RAG cascade.

ADR-0056 — Chat Answer-Shape Typology (six patterns, not sixty rules)

After the competitor at zolcase.novation.website/slim-zoeken produced visibly more scannable answers to "Wat is een gastroscopie?" the project's CHAT_BOLD_LEDE_RULE (added earlier the same day, commit b88fc086) was expanded into a typology:

PatternWhenShape
A — POINT-FACTSingle discrete answer (phone, address, hour)1-2 sentences + 1 citation
B — STEP-BY-STEP"How do I X?" proceduralNumbered list, citation per step
C — ATTRIBUTE-LISTSingle topic, 3+ distinct attributes1-2 intro paragraphs + bullets with bold labels (**Duur:** ... [3])
D — MULTI-ENTITYQuestion covers multiple entitiesOne bold-lede paragraph per entity (the former CHAT_BOLD_LEDE_RULE)
E — COMPARISON"X vs Y", "verschil tussen…"Brief intro + two parallel bold-labeled sections
F — DECISION-TREE"Wanneer moet ik X?" / triageConditional bullets: Bij ernstige…: action

The rule applies to chat only — voice continues as natural prose (bullets read awkwardly aloud). Voice path verified unaffected at rag_service.py:4641 injection site.

The architectural principle: typology beats rule-per-defect. Adding one rule per visible problem produces a prompt that grows linearly with defect history; a typology compresses future maintenance because new defects fall under an existing pattern.

ADR-0057 — Tenant-Scoped Prompt Addendums + Tenant-Agnostic Doctor-Profile Boost

When asked "Is er raadpleging voor Dr. Matthias Dupont op woensdag?" the system answered "geen raadpleging" while the competitor answered "Ja, woensdagvoormiddag." The canonical doctor profile in the corpus had the truth (markdown table cell VM × WO = RP2w), but two compounding causes inverted it: the LLM couldn't parse the ZOL schedule format, and retrieval thematic co-retrieval (cardiology page, Arts Anders interview page) diluted the canonical chunk's signal.

The fix lands at two layers:

  • Layer 1 — _TENANT_CHAT_ADDENDUMS registry. {slug: addendum_text} mapping injected into the chat system prompt after the answer-shape rules. Currently {"zol": ZOL_DOCTOR_SCHEDULE_RULE} — the rule contains the abbreviation legend + worked example + counter-example. Tomorrow's new tenant onboards without touching the shared template.
  • Layer 2 — tenant-agnostic _boost_doctor_profile in search_service.py. _DOCTOR_NAME_PATTERN regex extracts "Dr. <Name>" across nl/en/fr/it. Chunk scores get a 1.50× boost when the document title starts with dr. <name>. The 1.50 calibration sits between the campus boost (1.10) and the conversation-context boost (1.40); strong enough to pull the canonical profile from rank 2-3 to rank 1, not so strong it crushes other signals.

The architectural distinction codified in this ADR is now project policy:

Fix shapeIsolation surfaceExample
Tenant-specific data format_TENANT_CHAT_ADDENDUMS[slug]ZOL schedule table
Universal naming convention_boost_* method, tenant-agnosticDr. <Name> title-prefix boost
Universal answer shapeCHAT_ANSWER_SHAPE_RULESADR-0056
Tenant-specific FAQYAML in tenant overlayADR-0055 surviving entries

4 · Voice Overlay Admin — Sprint E Waves A through D

The voice overlay system (tenant phonetic recovery + medical taxonomy + crisis dispatch + STT rules) was previously YAML-edited by engineers. This sprint built the full admin surface — hospital admins can now CRUD their own overlays from the /admin UI.

WaveWhatCommits
AUnified voice routing rules (one rule per slug, language-aware)9e4e9ab5
A.5Medical taxonomy → tenant overlay (move from hardcoded Python dict to YAML)c9d994c2
A pt 2Pre-LLM crisis/emergency dispatch (regex pre-filter on the voice path, latency-zero)f85726d7
BVoice-overlay read API (GET /api/v1/admin/voice-overlay/{slug})e3297cd0
B testPin crisis_suicide_ideation in Wave B spec contractb7c91292
B refactorExtract list_known_slugs to registryf0770fce
CVoice-overlay viewer UI0e82ba5d
C specZol-default + missing test coverage500f2ade
C polishA11y + type-safety4382ee80
D backendVoice-overlay write API + YAML import/export1130f103
D1 polishUP035 + dead async helper drop + 404 race subclass + audit capc06aa7df
D2 polishDefault taxonomy count + last-modified + a11y90b3c698
D3Routing-rule edit modal + delete flow9b6ae389
D3 reviewBackdrop, regex neutral state, focus trap, narrowed types4ce002b1
D4Taxonomy inline-edit + add-new row324f7bfa
D4 reviewKeyboard tests + autofocus on edite9ea4bdc
D5YAML import/export tab5214b88f
D5 reviewBlob URL cleanup + test spy restoref1835045
D polishEmpty-tenant onboarding banner2ba8956a
D foundationMutation hooks + Wave C UX polishf570c687

The Wave D inline-edit pattern (commit 324f7bfa) is worth calling out — the row goes into edit mode in-place, ESC reverts, Enter commits, and there's an explicit + Add new row at the bottom. No modal for taxonomy edits. The routing-rule edit, by contrast, does use a modal because regex validation needs a preview before save.

The empty-tenant onboarding banner (2ba8956a) — final Wave D commit — handles the case where a hospital admin opens the page for a tenant whose YAML has not yet been seeded. Instead of an empty grid, they see a "Start from ZOL defaults" CTA that copies the canonical entries into their tenant slug as a starting point.


5 · Voice persona v1 + tenant-driven greeting

Commit c61ad16b introduces per-tenant voice persona — name, voice ID, greeting text, fallback line, available languages — all served from a new public endpoint (GET /api/v1/voice/persona/{tenant_slug}) consumed by the LiveKit voice agent at SIP-bind time.

Today, the ZOL persona reads:

"Goedendag, u spreekt met Zoë, de virtuele assistent van het Ziekenhuis Oost-Limburg. Hoe kan ik u helpen?"

— spoken by ElevenLabs voice ID pwMBnCuw3J0IFGFnFEFb at speed 1.05. The persona payload, the greeting, and the speed cap are all DB-backed now; an admin UI changes the persona's name from "Zoë" to anything else without a deploy.

Companion commits:

  • 997c8db5 — minimal hospital-context preamble + Q1 conv_id observability. The voice LLM no longer prefixes every turn with the full hospital description; the persona endpoint carries the context once at greeting time.
  • 62f18695 — SIP-bound conversation_id adopted on first turn. Fixes the silent-failure regression where SIP-binding produced a stable conversation_id but the first WS turn still generated a fresh one, breaking multi-turn memory.

6 · Seven RAG fixes (T1–T7) + F1 — MedChat 50-Q benchmark moves 87.5 → 91.1

The fixes from the 2026-05-11 comparison-RCA sprint (docs/2026-05-11-comparison-rca-fixes.md) shipped across this window. The benchmark moved from 87.5 avg / 3 wins / 21 losses to 91.1 avg / 23 wins / 7 losses / 0 P0.

IDCommitWhatImpact
T17ad18ff9Intent-classifier spoed routing + drop hardcoded "ZOL" from prompt + test consolidationQ05, Q31, Q39
T2e5d93ac8Race-guard against duplicate fallback row + language-distinctive test assertsQ-fallback-race
T33dca6bef30s streaming timeout writes localized fallback — closes P0 "stuck on pending"Q-pending-stuck
T4eb389ae2Q32 maagonderzoek compound regex + aanmelden comment fix + cache-stickiness noteQ32
T6c7bc1fceIntent-aware top-K + comprehensive-intent prompt — closes 12 coverage lossesQ-coverage × 12
T7819c704cQuery-rewrite expansion for ambiguous head terms — closes 3 retrieval missesQ-ambiguous × 3
F100e13367Inline [N] citation markers for chat channel — closes rubric tiesAll chat
Shaperfd081160Wire medical_taxonomy into doctor-enumeration — Q40 cardiology mappingQ40

T5 was deferred — its scope expanded mid-implementation and was queued for the next planning window.


7 · Diagnostic v2 hardening (8 follow-up fixes)

The v2 diagnostic landed in May 4-9. Operator use revealed a string of edge cases:

CommitFix
8494b483json-repair fallback for malformed LLM output (close conv 7509b0e0 line-725 regression)
f8542a34STT-mishearing awareness + retrieval verification + per-claim grounding
e650385bOperator feedback loop + diagnostic_accuracy_rate metric on Operations dashboard
45c349a9Install json-repair in app image (pyproject dep not in base)
df06ea33Widen full-list triggers + inject synthetic department-doctors chunk (close conv a60d3f30 regression)
e9925d4aBump LLM timeout to 240s (close conv a60d3f30 hang)
62db818cMeta-contract pinning timeout vs max_tokens consistency
f537713bv1 fallback uses json-repair + v2 schema errors logged inline
94f2373aRelax DiagnosisV2 schema to match what gpt-5.2 actually emits
1ef7bbd9Sanitize LLM output before Pydantic to absorb model variability
93270328Coalesce overall_score in log line (pyright cleanup)
50306ef7Migrate to pydantic-ai for native retry-on-validation-error (later reverted via b8d8da67 — see §2)
9000c022Scale max_tokens with turn count + correct error code (close conv 7509b0e0 regression)
fa6a247eBump max_tokens budget for 16+ turn calls (close conv 0745d984 regression)
3495fe8fFire-and-forget POST + polling GET — close Cloudflare 100s gateway-timeout regression
3be59d075 improvements for excellence — close the v2-investigation gaps
d215a0a08 improvements for operator-language match, grounding visibility, voice telemetry attribution + resume/watch/category breakdown

The Cloudflare 100s timeout fix (3495fe8f) is the most architecturally interesting — long v2 diagnoses on 16+ turn conversations were hitting Cloudflare's 100-second proxy timeout. The fix converts the synchronous POST /diagnose into a fire-and-forget POST that returns immediately with a job ID, plus a polling GET /diagnose/{id} that the UI hits every 2 seconds until the result is ready. Cloudflare sees only short requests; the actual LLM work runs uncapped on the backend.


8 · Intent Cache (Redis-backed, cross-worker) + admin kill switch

Spec'd by ADR-0054 (§3 above), shipped across three commits:

  • 5fed3bef — In-memory intent cache (Experiment C). Per-worker OrderedDict LRU + TTL. Proves the concept under single-worker pilot config.
  • 17a7f56e — Redis-backed cross-worker backend + admin kill switch. Shared across worker replicas via the existing app.db.redis connection pool. JSON round-trip through Pydantic. SCAN-targeted intent_cache: prefix.
  • f9a335c4 — UI "Clear Cache" button now wipes intent cache too. Single click recovers from cache poisoning across both caches.

Verification on pilot zol-rag-app:f9a335c4 (2026-05-12):

curl -X POST .../api/v1/query -d '{"query":"Hoeveel ziekenhuisbedden heeft het ZOL?","channel":"web"}'
redis-cli --scan --pattern "intent_cache:*"
# Returns: intent_cache:|nl|hoeveel ziekenhuisbedden heeft het zol

Test coverage: 12 unit tests on MemoryIntentCache, 13 integration tests on RedisIntentCache against a Redis 7 testcontainer, 4 integration tests on the kill-switch endpoint.


9 · Data Quality A / B / C / D — code-quality discipline at the data layer

The 2026-05-12 audit (docs/audits/2026-05-12-data-quality.md — not in scope this release, referenced) found:

IssueCount
Abandoned docs (pending > 48 h)258
Duplicate doctor profiles40
Lorem ipsum chunks in production corpus312

All sitting silently for weeks. None caught by pytest. None caught by deploy.

The fix is four parallel gates, modeled on the four code-side gates (ruff / pyright / tsc / eslint):

Layer A — Daily observability (commits e43de39b7f2789c5d2567b7b)

backend/scripts/audit_data_quality.py (~325 lines, single file, no new deps). Nightly cron at 03:30 UTC (30 min after the 03:00 ingest pass). Emits markdown report to tests/evaluation/results/data-quality-<YYYYMMDD>.md. Exit codes: 0 clean / 1 warning / 2 alert. Wired into APScheduler in d2567b7b.

Metrics emitted:

  • Pending-age p50/p95 (soft >24 h, hard >48 h)
  • Failure-reason histogram (soft >25, hard >100), with failed_real vs intentional_softdeletes split (per 7f2789c5 refinement)
  • Duplicate-title count (alert on ≥1; should always be 0 with Layer B)
  • Per-tenant doctor coverage (% with schedule, % with Lorem ipsum)
  • Chunk-quality stats (short/empty/lorem)

Layer B — Canonicalization + dedup gate (commits e43de39b + migration 068)

Two complementary mechanisms at document_service.py:1262+:

  1. The crawler already extracted the <link rel="canonical"> value but discarded it. Now stored in doc_metadata.canonical_url.
  2. Pre-insert query: before INSERT, check for an existing completed doc whose normalized title (with | <brand> suffix stripped) matches. If found, skip.

Migration 068 (commit d2567b7b) adds a partial unique index on (tenant_id, normalized_title) WHERE status='completed'. Schema-level enforcement: even if application-layer dedup is bypassed (manual insert, race, future code path), the DB rejects the duplicate. Partial because pending/processing/failed rows transit through transient duplicate states by design.

Commit 26ba1562 adds the empty-normalized-title guard — Layer B rejects ingest of docs whose title normalizes to empty string (a common drift mode for very-short pages).

Layer C — Sanitization + extractor (commits cfcb4e2b + 7f2789c5)

  • Lorem ipsum sanitizer (cfcb4e2b) — one-shot script that walks existing chunks and strips template Lorem ipsum text. Idempotent. Backfill-safe.
  • Schedule-table extractor (7f2789c5) — extract_consultation_schedule() in data_quality.py. Parses ZOL Drupal schedule tables into structured JSON. Empty cells omitted; all-empty tables return None. Future cell codes (RP3w, etc.) pass through verbatim — no fixed vocabulary constraint. Hooked into document_service.py:1314+ so every new ingested doc gets its schedule extracted and stored at doc_metadata.consultation_schedule. Existing docs backfilled by scripts/backfill_consultation_schedule.py.

The schedule extractor is the structured-data fallback to ADR-0057's ZOL_DOCTOR_SCHEDULE_RULE prompt rule. The prompt rule teaches the LLM to parse markdown when consumers haven't been upgraded yet; the extractor makes the structured JSON available for any future code path that wants deterministic schedule answers without LLM table parsing.

Layer D — Post-completion verification (e43de39b)

Audits run post-ingest cycle and verify what landed matches what was crawled. Layer D doesn't have its own commit beyond e43de39b because the verification logic lives inside Layer A's audit script — Layer D is the what we check dimension, Layer A is the when we check it dimension.

Companion fixes

  • 357725cbd.metadata column (DB) not d.doc_metadata (Python attribute). Backfill SQL had the wrong column name.
  • e03ff4cb — pin department-overview format → None decision. Department-overview pages have no consultation schedule by design; the extractor must return None, not raise.
  • f0a577aa + 6c3ddfb5 — multi-match handling in find_document_by_source_url. Layer B's dedup needs to behave deterministically when the same canonical URL surfaces in two different ingest passes.
  • 9fa6fa16PERCENTILE_DISC for timestamp percentiles (was PERCENTILE_CONT). PERCENTILE_CONT interpolates between two timestamps, which produces a fractional-microsecond value that PostgreSQL can't cast back to timestamp with time zone.

10 · Value Dashboard — /value/trend endpoint + live today aggregation

The previous release shipped the dashboard with the daily-volume chart receiving an empty array — the data point existed in daily_tenant_metrics but no API surface yet. This release closes the gap:

  • 1bc39819GET /api/v1/admin/value/trend?days=N&channel=X returns a daily series for the chart. Sub-50 ms with the existing partial index.
  • fea7bd65 — Live today aggregation. The dashboard's today row was previously stale until the next 02:30 UTC nightly aggregator. The fix computes today's row on-demand by querying conversations directly (the same query the nightly job runs, just bounded to today's date). Stacks with the 60-second cache so dashboard reloads are cheap.

The "today" gap was the most-requested operator complaint from the v1.0 release window.


11 · Chat UI competitor-parity polish

Commit b88fc086 ships a visual refresh of the chat surface to match the competitor's polish level:

  • Bold lede — the first sentence of every multi-entity answer is bolded (paired with the Pattern D rule from ADR-0056).
  • Chip row — common follow-up questions render as clickable chips below the answer (re-uses the existing follow-up suggestion logic; previously rendered as a separate "You might also ask…" section).
  • Navy hero — the welcome state's hero panel now uses the ZOL navy accent color rather than the previous neutral white.
  • tel: links — every phone number in every answer is now a <a href="tel:..."> — taps to call on mobile, click to call on desktop with a supported handler.

The tel: link fix is the most operationally impactful — previously, calling the hospital from chat required copy-paste. Now it's one tap.


12 · Voice fixes (8 user-facing improvements)

Smaller-bore but valuable:

CommitFix
e24d4d84Voice orchestrator query sanitization — extract core question from STT noise/filler before tool dispatch
7cc1bd4dBroaden phone-number FAQ regex (close T10 "algemene nummer" miss)
93a01b29TTS-friendly decimal currency normalisation — 0,4391nul comma vier vier euro per kilowattuur
5ca32392Require explicit goodbye for end_call (close T11 soft-farewell regression)
59b6cbb7Explicit transfer-verb gate — fixes never-firing handoff
b7f7f75ePersist channel='voice' on fallback conversation create — was defaulting to web
337d6ef0Ellipsis joiner no longer truncates phone numbers + corrected fastmcp diagnosis
8953764eGZip middleware + voice-channel max_tokens cap (300) — caps verbose voice answers at ~30 s read time

The max_tokens=300 cap is worth dwelling on: voice answers longer than ~300 tokens take more than 30 seconds to read aloud, by which time the caller has usually interrupted. The cap forces the LLM to prioritise — and the answer-shape typology (ADR-0056) is chat-only specifically so voice answers stay tight and prose-shaped.


13 · OpenRouter removal (Phase 2)

Commit 49225042 deletes the OpenRouter code paths after Phase 1 last sprint made gpt-4.1-mini the default LLM. The Phase 2 removal:

  • Drops the OPENROUTER_API_KEY setting.
  • Removes the _get_openai_or_openrouter_client() factory.
  • Removes the openrouter_default_model configuration knob.
  • Removes the 2 test files that pinned OpenRouter routing behaviour.

After Phase 2, every LLM call in the system goes through OpenAI directly. The gpt-4.1 / gpt-4.1-mini / gpt-5.2 model identifiers are now OpenAI model IDs; no provider abstraction layer.

This is the prelude to ADR-0058 (planned next window) which formalises the per-LLM-call model routing policy based on the May 11 LLM-mix proposal (docs/2026-05-11-llm-mix-proposal.md). The current state: gpt-4.1 for high-stakes (medical-content judging, dialogue management, v2 diagnosis), gpt-4.1-mini for chat answer generation + most other call sites, gpt-4.1-nano reserved for the cost-critical safety gate.


14 · Telemetry split — first_token_ms + generation_total

Commit 2fdf69f3 splits the generation-latency telemetry into two columns:

  • first_token_ms — time from prompt send to first streamed token. This is what the user perceives as latency.
  • generation_total — total streaming time including the final token.

The split lets the operator answer the question "did the user feel this turn as slow, or was the answer just long?" — important because long-but-fast-first-token answers (Pattern C bulleted procedure explanations) feel responsive even when the total generation is multi-second.


15 · Methodology v2.3 — Decision-Cost Rubric + Brainstorm Gate

Commit c2dce94d lands the v2.3 amendment in this project's CLAUDE.md (referencing /Users/soft4u/Development/s4u-methodology/docs/methodology.md §2.7 + §3.1), and ships the Docusaurus showcase at docs/methodology/decision-cost-rubric.md.

The rubric is six axes:

  1. Latency — is the proposed change in the per-turn path? How much does it add?
  2. Dependency surface — new package? Transitive deps? Yanked deps?
  3. Debuggability — how does a future engineer track a failure through this layer?
  4. Reversibility — how long does it take to revert if measurement says it's worse?
  5. Blast radius — single call site? Replicated across N call sites?
  6. Alternative — what's the cheapest thing that solves 80% of the problem without this change?

The Brainstorm Gate (methodology §3.1, project CLAUDE.md): when a proposal triggers any of [new dependency, replicates 3+ sites, >100 ms latency change, public API/schema modification, >2 h estimated work], the proposer must emit a Pre-Mortem Block addressing all six axes plus "Strongest risk", "What would change my mind", "Confidence" BEFORE proceeding to spec / plan / code.

The load-bearing case study, captured in the Docusaurus showcase, is the pydantic-ai migration from earlier this same window (commits 942c2cbe59bd5727, reverted by b8d8da67 for the ~700 ms perf hit described in §2). If the Brainstorm Gate had been enforced at proposal time, the "latency" axis would have demanded an end-to-end measurement against the existing helper before the 8 call-site migrations landed. The measurement happened post-deploy. The investment was reversed. The rubric is the operational lesson.

Memory feedback_methodology_v2_3_brainstorm_gate.md mirrors the canon with project-specific notes. From now on the user is empowered to interrupt with "pre-mortem first" and restart any qualifying turn.


16 · Docusaurus 3.10.1 upgrade + 9-persona voice scenario library

Commit 9f84498f lands three coordinated changes in one PR so the docs build and the new persona library deploy in one step:

  • Docusaurus 3.9.2 → 3.10.1. MDX v3 ships in 3.10 with stricter parsing. Migration fixes: 11 <!-- ... --> HTML comments rewritten to {/* ... */} across 10 docs; markdown.format: 'detect' in config to keep .md files in commonmark mode (otherwise every {#heading-id} heading marker fails acorn parsing); 62 bibliography heading texts stripped of leading @ (was parsed as JSX component reference).
  • 9-persona voice scenario library — under docs/voice/test-scenarios/. Nine personas (Anna the polite caller, Bram the impatient one, Carla the elderly slow speaker, etc.) × 8-12 turns each. Each persona is [caller script | observable backend events] paired to drive a deterministic smoke test. The library is the input for the next iteration of the voice golden eval harness.
  • Voice golden-eval harness (tests/evaluation/run_voice_evaluation.py with --use-pilot HTTP mode) — the runner now invokes the deployed pilot's WS endpoint and validates each turn against the persona script. The mode landed in commit cb773301. Persona content was tuned to 80/82 turns passing against pilot in f14a7c3b (Sprint E Task 4).

The Mermaid theme upgrade (commit aad8be91) makes sequenceDiagram labels readable at default zoom — previously the sequence-diagram label text was dark-on-dark.


17 · Q5 RCA — prompt-template forgery + dedup data-loss (iceberg debug)

User asked "why is Q5 still poorly answered?" — Q5 is "Zijn er laadpalen voor elektrische wagens?" — and the answer was Class C ("not explicitly mentioned"). Investigation surfaced two compounding bugs masking each other since 2026-05-12:

Bug 1 — Prompt-template anchoring (commit 2eecccaf). Cluster 2's "Class B" few-shot prompt used Q5 verbatim as the worked example. The LLM was parroting the example's hedged answer regardless of retrieved chunks — the "Op campus X zijn er laadpunten beschikbaar" wording was forged from the prompt example, not grounded in any chunk. Replaced the worked example with a synthetic uitleendienst voor rolstoelen query + added an explicit "Class A wins over Class B when specifics are present" precedence rule.

Bug 2 — Dedup misfire (commits 6eed9be8 + 82b54e12). A one-shot dedup pass on 2026-05-12 marked 155 documents as superseded-by-dedup-2026-05-12-broader using an oldest-wins heuristic. The Parkeerinformatie page with the gold laadpalen content (6 chunks incl. €0,4391/kWh + 11 kW + per-campus list) was demoted to failed; a 3-chunk stub kept as the "winner". When bug 1's prompt example was removed, the LLM stopped forging answers and exposed bug 2.

Audit-before-fix discipline (the load-bearing methodology call). Instead of mass-flipping all 155 superseded docs, I wrote backend/scripts/audit_dedup_broader_2026_05_12.py — read-only, generates a markdown report (backend/tests/evaluation/results/dedup-broader-audit-20260513.md) — and discovered only 17 of 155 (11%) had a meaningfully wrong winner. Below the 50% Pre-Mortem bulk-fix threshold; surgical fix was the right scope. The companion flip_dedup_broader_misfires_2026_05_13.py atomically flipped those 17 pairs (winner→failed FIRST so migration 068's partial unique index never observes two completed docs with the same normalized title). A second audit covered the 275 superseded-by-source-url-dedup-2026-05-13 rows (commits audit_dedup_source_url_2026_05_13.py + flip_dedup_source_url_misfires_2026_05_13.py); 7 more pairs flipped where the kept winner had zero chunks and the loser had ≥1 chunk + ≥200 chars.

Total: 24 docs restored. Q5's post-fix answer now scores 95/100 with full tariff + wattage + per-campus laadpunt counts — better than the 95/100 reference. Verzakking-en-incontinentiekliniek, Stereotactische radiochirurgie, Menopauze, and 17 others recovered in the same flip.

Memory feedback_dedup_heuristic_lessons.md captures the two-rule lesson: (1) never use oldest-wins for a periodically re-crawled corpus — the newer crawl is usually more comprehensive; (2) always pre-filter winner candidates to chunk_count > 0 — an empty doc is never a defensible canonical version of any non-empty alternative. The defensive rule is independent of the primary heuristic and is what should have prevented all 24 misfires.


18 · 50-Q benchmark post-Q5-fix — avg 93.9, dead-heat with ZOL Slim Zoeken

Post-fix 50-question MedChat-vs-ZOL-Slim-Zoeken benchmark (commit 55c5b727, report at backend/docs/2026-05-13-bench-post-dedup-flip.md + raw JSON):

Metric2026-05-11 baseline2026-05-12 post-perf2026-05-13 post-fix
MedChat avg87.590.093.9
ZOL Slim Zoeken avg(lower)(lower)93.8
MedChat <75manyseveral0
MedChat P0many0 (cache poisoning)0
Wins / Losses / Ties23 / 7 / 1922 / 6 / 225 / 9 / 36

The win-count metric collapsed (22 → 5) because both systems now reliably score 95 — 36 of 50 questions are 95/95 ties. The avg score is the right read: MedChat moved +3.9 points overnight. The remaining 9 losses are 90/95 or 85/95 margins — judge-side nitpicks (Q1 wants "check website" framing; Q4 has raw [N] markers; Q35 conflates scan time vs scan+wait time). None worth chasing at avg 93.9 with 0 P0s.


19 · Voice eval recovery — corpus restoration flowed through to phone channel

The voice golden eval on the post-dedup-fix pilot moved from 1/9 personas / 80.5% turn-pass (this morning, pre-fix) to 3/9 personas / 87.8% turn-pass (post-fix, 7m07s wall time, label post-dedup-flip-2026-05-13-55c5b727).

★ Architectural confirmation — the voice channel inherited the corpus-restoration wins from the chat-side dedup flips. Three personas (Sofie Peters 10/10, Mevrouw Maeyens 8/8, Christophe Lefebvre 10/10) that were previously failing now pass cleanly. The thin pipeline + shared RAGService mean a corpus win flows through to every channel for free — no per-channel content fix needed. This is the right architectural sign.


20 · Voice Waves 0 → 2 — disclaimer-once, billing intent, pharmacist deflect, language fidelity, latency budgets

After the corpus-restoration recovery still showed 6 voice-eval failures, root-cause analysis broke them into 4 categories. Wave 3 (intent prompt shrink) was deliberately skipped per the 2026-05-11 F2 misdiagnosis report — the prompt is already ~3k tokens, the actual shrink win is <100ms not 500ms, and Wave 2b's budget bumps already absorbed the latency-overrun failures.

Wave 0 — Disclaimer once-per-conversation (commit f062690e)

User feedback: "the AI says 'This is not a medical recommendation' too often. It should only say that once per conversation, because it gets annoying."

The voice answer-shaper auto-detected medical content on every turn and prepended the per-language disclaimer. For multi-turn calls the prefix became repetitive spoken padding that diluted its own credibility.

New module app/services/voice/disclaimer_tracker.py: Redis-backed flag keyed by conversation_id with 1-hour TTL. was_emitted(conv_id) -> bool and mark_emitted(conv_id) -> None. On Redis failure both return False (disclaimer falls back to firing — safer default; never silently swallow a medical disclaimer on infra flake). VoiceAnswerShaper.shape() got a new suppress_disclaimer: bool parameter and exposes diag["disclaimer_prepended"]. The orchestrator's _execute_tool plumbs conversation_id through, consults the tracker before shape(), and marks-emitted after.

R3 contract test pins the once-per-conversation invariant: two consecutive turns on the same conv must emit AT MOST ONCE, regardless of medical content on the second turn.

Chat channel is unaffected — safety_service.append_disclaimer is already a no-op (frontend renders the disclaimer below every chat answer). The over-firing surface was voice-only.

Wave 1a — BILLING_INQUIRY intent + safety whitelist (commit e9cf4717)

Voice-eval persona_07 (Christelijke Mutualiteit caller) failed 4 turns: T1-T3 routed to "medisch archief" or RIZIV instead of the facturatie helpdesk, T5 returned the wrong (facturatie-direct) number when the caller asked for the algemene number. No BILLING_INQUIRY intent existed; the queries fell through to general RAG.

Mirrors the Cluster 1 (institutional_treatment_info) + Cluster 3 (doctor_schedule_query) architecture:

  1. UserIntent.BILLING_INQUIRY — new enum member + drift-pin test updated.
  2. detect_billing_inquiry — pre-LLM regex gate with two paths: AND-logic (payer mention AND dossier/code reference) + solo allow-list (factuur, remgeld, factuurnummer). Conservative — "verzekering" or "mutualiteit" alone, as in "wordt mijn behandeling vergoed door mijn verzekering?", stays as general RAG.
  3. _SAFE_INTENTS whitelist — billing is institutional info, LLM safety judge bypassed.
  4. get_billing_inquiry_response(language, ctx) — tenant-agnostic per-language routing template (nl/en/fr/it). Always contains the literal "facturatie" token so persona contracts match.
  5. rag_service Stage-2d short-circuit — emits the template directly without retrieving from the corpus.

17 unit tests pin the contract: 6 AND-path positives (incl. persona_07 verbatim), 4 solo allow-list, 7 negative cases (generic insurance coverage queries, medical advice, doctor lookup, navigation, empty/trivial inputs).

Wave 1b — Pharmacist-context dosing deflection (commit edf72b34)

Voice-eval persona_08 T2 (Apotheek Maaseik) failed: a pharmacist asking about bisoprolol dosing for an elderly post-TIA patient. The historical "TIA" mention triggered emergency_solo_keywords and got the 112-dispatch template — wrong for chronic-care dosing.

A safety-credibility issue, not a numerical-failure issue: if pharmacists learn the bot panics on every medication question, they stop trusting any of its routing.

New routing-category pharmacist_deflect at band 1 (between crisis=0 and emergency=2). Crisis still pre-empts (a suicidal pharmacist asking about an overdose dose IS a crisis case first); pharmacist_deflect pre-empts emergency-keyword matching for non-acute professional queries.

New YAML rule pharmacist_dosing_deflect with AND-logic in nl/en/fr/it: pharmacist signal (apotheek/voorschrift/prescriptie/pharmacy/prescription) AND dosing signal (dosis/dosage/mg/microgram/interactie). Both required — bare "apotheek" stays as general RAG. Deflect response routes to "voorschrijvende arts" / "huisarts" / helpdesk.

6 unit tests — including 1 load-bearing safety pin that a patient saying "ik denk dat mijn vader een TIA heeft" must STILL hit emergency dispatch with "112". The deflect rule cannot weaken the safety net for actual patient callers.

Wave 2a — Filler-template language fidelity (commit 0e5343f0)

Voice-eval persona_06 T3: an English caller asking "We have CM — Christelijke Mutualiteit — through his employer. Is that accepted at ZOL?" got back "Ik zoek dat even voor u op." — Dutch filler. The system prompt's Dutch examples were demonstrably out-prioritising the single trailing "match the caller's language" instruction.

build_voice_llm_orchestrator_system_prompt(ctx, language="nl") got a new language parameter. Per-language # REPLY LANGUAGE: <lang> hint blocks are prepended as the FIRST content in the system prompt — establishing the reply language before the Dutch examples downstream. EN-specific block explicitly says "NEVER respond in Dutch when the caller spoke English, even briefly." — the generic instruction wasn't enough. nl/en/fr/it parallel coverage per feedback-multi-language-voice-coverage.md.

7 unit tests, including a position pin (hint must appear BEFORE "You are the AI" identity line), unknown-language fallback to English (safer global default), and the load-bearing "NEVER respond in Dutch" phrase pin for persona_06 T3.

Wave 2b — Latency budget calibration + test phrase refinement (commit 1f4c15f0)

Three latency-over-budget voice-eval failures where content was correct but timing was 5-15% over the budget (persona_02 T6 12509/12000 ms, persona_08 T6 13125/12000 ms, persona_09 T4 8904/8000 ms). Real LLM-jitter on multi-tool / safety-judge turns. Pre-Wave-1 budgets pre-dated the new Redis lookup + extra dispatcher categories. Bumped 16-25% to absorb jitter without making checks toothless.

Plus persona_10 T5 expected_phrases: added "kan ik niet" to the any-of list. The LLM emitted natural Dutch syntax "Het mobiele nummer ... kan ik niet geven" — the literal grep for "kan niet" missed it because of the intervening "ik".

No application-code changes — pure test-data tuning.

Wave 3 — SKIPPED with explicit rationale

Per the 2026-05-11 F2 misdiagnosis report (docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md): the intent prompt is already ~3k tokens (not the 11k the original premise claimed). A shrink would save ~50-100 ms of LLM prefill, not the 500 ms originally projected. The 3 latency-over-budget failures are already absorbed by Wave 2b. Pursuing the shrink would create churn in a file that lands every intent regression test, for a benefit the test suite no longer needs. The real high-value latency lever — O1 parallel intent + retrieval — is a multi-hour architectural change tracked separately.


21 · ADR-0059 — tenant + language extension plan for Cluster 1+3 (proposed)

Commit d632afce ships ADR-0059 (docs/ADR/0059-tenant-and-language-extension-cluster-1-and-3.md — Proposed) — explicit plan for extending the Cluster 1 (institutional_treatment_info) and Cluster 3 (doctor_schedule_query) work to fully multi-tenant + multi-lingual. Three-axis audit (A = tenant data, B = corpus format, C = language coverage) found 7 specific gaps. Three-phase migration plan with effort estimates:

  • Phase 1 (~half day) — extract the ZOL literal from Cluster 1 regex via get_prompt_context().short_name. Lowest blast radius, unblocks second-tenant onboarding immediately.
  • Phase 2 (~1 day) — schedule-extractor registry per tenant in data_quality.py (Layer C extension).
  • Phase 3 (~1.5 days) — nl/en/fr/it regex + response-template coverage for Cluster 1+3.

Status: Proposed. Acceptance criteria per phase documented; out-of-scope items (corpus migration, onboarding UX, corpus translation) explicitly listed so the ADR doesn't drift into a SaaS-platform overhaul.


22 · Dedup audit/flip toolkit (reusable across future cleanups)

Four new one-shot scripts committed under backend/scripts/:

  • audit_dedup_broader_2026_05_12.py — read-only audit of the title-normalized dedup pass; outputs markdown report with Pre-Mortem signal (loser-wins rate vs 70%/50% thresholds).
  • flip_dedup_broader_misfires_2026_05_13.py — atomic per-pair flip with idempotent re-run + post-flip unique-constraint verification.
  • audit_dedup_source_url_2026_05_13.py — read-only audit of the source-url dedup; special-case bucket for empty-winner pairs (always unambiguous misfires).
  • flip_dedup_source_url_misfires_2026_05_13.py — flips the 7 empty-winner pairs.

Both flip scripts MUST run inside a single transaction because migration 068's partial unique index ux_documents_tenant_normalized_title_completed would otherwise see two completed docs with the same normalized title mid-flight. The winner→failed update happens FIRST so the loser can flip to 'completed' without colliding.

Pattern reusable for any future dedup cleanup. Memory feedback_dedup_heuristic_lessons.md captures the two-rule defense (never-oldest-wins + chunk-count-greater-than-zero pre-filter).


23 · Numbers

MetricBefore this releaseAfter this release
Pilot-review readiness artifacts05 (Phase 5 bundle)
Drift registers06 (one per topic area)
ADRs5359 (5 from earlier in window + ADR-0059 proposed)
MedChat 50-Q benchmark (midweek)87.5 avg / 3 wins / 21 losses91.1 avg / 23 wins / 7 losses / 0 P0
MedChat 50-Q (post-fix, 2026-05-13 afternoon)90.0 avg / 22 wins / 6 losses / 22 ties / 2 cache-poisoning P093.9 avg / 5 wins / 9 losses / 36 ties / 0 P0 (ZOL avg 93.8 — dead heat)
Voice eval (post-dedup-fix)1/9 personas, 80.5% turns3/9 personas, 87.8% turns (corpus restoration alone)
Chat p50 latency budget (estimate)~9 800 ms~9 100 ms (-700 ms from §2)
/api/v1/admin/ops/latency-percentilesdid not existlive
ZOL-specific hand-curated FAQs10 (3 contradicted by corpus)0 (purged per ADR-0055)
Chat answer-shape rules1 (CHAT_BOLD_LEDE_RULE)6-pattern typology (ADR-0056)
Tenant-scoped prompt addendums01 (ZOL doctor schedule, ADR-0057)
Voice overlay admin surfaceYAML-onlyfull CRUD UI + import/export
Intent-classification cachedid not existmemory + Redis backends + UI kill switch
Data Quality nightly auditdid not existLayer A + scheduler at 03:30 UTC
Voice intents8 + 2 (Cluster 1/3 added midweek)11 (+ BILLING_INQUIRY Wave 1a)
Voice routing categories9 (crisis → faq)10 (+ pharmacist_deflect Wave 1b, band 1)
Voice medical disclaimer firingevery safety-flagged turnonce per conversation (Wave 0)
Voice system promptNL-implicitper-language hint block at top (Wave 2a, nl/en/fr/it)
Documents restored from dedup misfires024 (17 broader + 7 source-url empty-winner)
Pilot DB migration head066068
Backend tests~4 900~5 100 (+200 incl. Wave 0-2 contract tests)

What's queued for next release

  • Run the full Golden Eval against pilot HEAD — voice eval first (full set against https://zol.novation.website/api/v1 SIP path), then chat eval (299-question set) to confirm the latency wave + 5 ADRs + 7 RAG fixes hold end-to-end against the production image. Currently scheduled for the release-deploy session that's writing these notes.
  • Phase 3 of ADR-0055 — demand-driven FAQ promotion pipeline. Observe conversation_messages → cluster → score against RAG quality → auto-draft FAQ entries on low-quality × high-demand clusters.
  • Phase 2 of ADR-0055 — nightly audit_faq_corpus.py cron. The script is sketched in the ADR; needs implementation + alerting wiring.
  • Per-tenant affinity override table — currently the Value Framework affinity map is a module-level Python dict. Tenants in non-default content distributions need DB-backed overrides via a new app.intent_category_affinity table.
  • ADR-0058 — per-LLM-call model routing policy — formalises the May 11 LLM-mix proposal as a project decision.
  • Twilio Phase B — pilot DNS + Let's Encrypt SIPS certificate + firewall rules. ADR-0050 has the runbook.
  • Audio-loop evaluation harness — voice eval is currently turn-text-based against transcripts; the SOTA matrix's "audio-loop" gap (§Phase 4 above) gets filled here.
  • Value Dashboard v2 polish — PDF export, language drill-down, custom date-range picker, business-hours admin UI. Carried over from the previous release.

References

  • Previous release: May 4 – 9, 2026
  • Pilot-review readiness plan: docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md
  • Latency opportunities research: docs/2026-05-11-latency-opportunities-research.md
  • F2 misdiagnosis report: docs/2026-05-11-f2-intent-prompt-shrink-misdiagnosis.md
  • Comparison RCA + 7-fix implementation plan: docs/2026-05-11-comparison-rca-fixes.md
  • LLM-mix proposal: docs/2026-05-11-llm-mix-proposal.md
  • ADR-0053 through 0057 — docs/decisions/ (Docusaurus) + docs/ADR/ (source-of-truth)
  • Decision-Cost Rubric showcase: methodology/decision-cost-rubric.md
  • Drift registers: docs/audits/2026-05-09-*.md
  • Voice compendium (Phase 3): compendium/
  • SOTA positioning matrix (Phase 4): positioning/
  • Pilot-review artifact bundle (Phase 5): pilot-review/
  • Telemetry & Grafana/Prometheus runbook: operations/telemetry-and-runbooks
  • Q5 RCA dedup audit reports: backend/tests/evaluation/results/dedup-broader-audit-20260513.md + dedup-source-url-audit-20260513.md
  • Post-fix 50-Q benchmark: backend/docs/2026-05-13-bench-post-dedup-flip.md + .json (raw)
  • ADR-0059 (proposed): docs/ADR/0059-tenant-and-language-extension-cluster-1-and-3.md
  • Dedup heuristic lessons (memory): ~/.claude/projects/-Users-soft4u-Development-zol-rag/memory/feedback_dedup_heuristic_lessons.md