Release Notes: May 4 – 9, 2026
v2 Diagnostic, Value Dashboard, Value Framework
~50 commits | 5 days | New customer-facing analytics surface | New hospital-agnostic retrieval-steering framework | Voice citation pipeline structurally repaired | Per-dimension diagnostic dashboard finally renders
This release is the first since the May 4 voice-pipeline cut. Five days of intensive work that closes three classes of bugs at the architectural level (no per-tenant patching, no prompt babysitting) and ships the customer-facing analytics surface the pilot has been waiting for.
The headline themes:
- Voice citation pipeline structurally repaired (May 7) — three downstream helpers all silently assumed inline
[N]markers; voice answers strip them. Result: every voice turn was persistingcitations=NULLand the v2 diagnostic had no chunk evidence to validate. Three commits, one cascade closed, generic fix. - v2 Diagnostic redesigned end-to-end (May 7) — per-dimension scoring, citation validator, adversarial counter-evidence pass. The "feels brittle" report on May 9 was traced to schema-validation silent fallback; fixed in one line with
response_format=json_object. - Graduated medical-advice refusal (May 7) — the safety layer no longer says "call 112" for everything; it walks the helpdesk → GP → out-of-hours GP → 112 ladder.
- Value Dashboard v1.0 shipped (May 7-9) — 23-task plan, customer-facing
/valuepage +/platform/valuecross-tenant rollup. Belgian healthcare cost model with 4-tier fallback resolver. Migration 065. Email digest deferred to a future release. - Value Framework — hospital-agnostic retrieval steering (May 9) — closes the wheelchair-conflation class of bug across every future tenant. Intent-to-category affinity rerank, primary-category prompt guard, unit-mismatch admission, per-turn
category_mismatch_ratetelemetry. New packageapp.services.value_framework+ migration 066. 76 framework unit tests + an 8-test wheelchair regression that pins the morning-of-2026-05-09 pilot bug. - Operations dashboard for retrieval health (May 9) —
/api/v1/admin/ops/category-mismatchendpoint surfaces the new metric on the existing/analytics/systempage. - Methodology shift (May 9) — every fix lands with at least one integration test that pins the regression. We no longer manually re-test what's already coded and tested.
v2 Diagnostic — full redesign across four phases
The v1 diagnostic was a single LLM blob — one prompt, one verdict, one markdown bullet list. It was right most of the time, wrong on roughly one in five cases, and the failures were unaccountable: there was no way to see which dimension of the answer the LLM judged broken. By May 7 it was clear the pilot needed a richer, more accountable diagnostic.
Phase A — Per-turn chunk evidence + v2 schema scaffold (46e95895)
The v2 schema (app/schemas/diagnosis_v2.py) introduces:
- A per-call dimension rollup:
correctness / safety / memory / tool_use / latency, each withscore: 0..1,verdict: pass/partial/fail,evidence: [...]. - A per-turn breakdown: each turn carries the same five dimensions plus
claims,citations_quoted, andcounter_evidenceslots. - A
model_usedfield — load-bearing because OpenRouter routing makes self-reported model unreliable.
The Phase A loader path also added (document_id, chunk_index) pair lookup in _collect_chunk_evidence so the diagnostic can find the actual chunks the answer was grounded on.
Phase B — v2 prompt + citation validator + runner (828c1341)
run_v2_diagnosis orchestrates: build prompt → call LLM → tolerant JSON parse → validate against DiagnosisV2 Pydantic schema → run citation validator → return (DiagnosisV2, meta). The citation validator strips claims whose quotes don't actually appear in the cited chunks (case + whitespace tolerant matching) and downgrades dimension scores per invalid citation.
Phase C — Adversarial counter-evidence pass (716bcd27)
The first-pass v2 diagnosis is biased toward false negatives — the LLM tends to flag the answer as wrong when it had a slight cue of doubt. Phase C runs a second LLM call that tries to disprove each first-pass negative claim. If the disprove pass succeeds, the dimension score is boosted. This asymmetric bias correction closes the residual ~20% accuracy gap from first-pass alone. Per-call cost: ~$0.03.
Phase D — Frontend rendering for v2 dashboard (e3bdcf87)
The InvestigationViewV2.tsx component renders the per-dimension cards, per-turn drill-down, counter-evidence panel, and the citation-validated chunk evidence. Falls back to the v1 markdown if dimensional_scores is empty.
Phase D follow-up — response_format=json_object (c1cfa026, May 9)
A "feels brittle" report from the operator on the morning of May 9 traced to a silent failure: gpt-5.2 (via OpenRouter) was emitting v2-style PROSE, not v2 JSON. The Pydantic schema validation raised ValueError, the admin route caught it, and the fallback persisted dimensional_scores={} — invisible to the v2 UI in perpetuity.
Fix is one line: pin response_format={"type": "json_object"} on the ChatRequest. The model now reliably emits parseable JSON and the v2 cards finally render.
Lesson: silent
try/exceptaround schema validation for "best-effort fallback" hid a load-bearing regression for two days. The fix is the constraint, not the catch. Future v2 prompts should also useresponse_formatexplicitly.
Voice citation pipeline — three-helper cascade closed
A 2026-05-07 morning smoke test revealed every voice turn persisted with citations=NULL. The cause was a cascade across three helpers in _qs_finalize, each individually correct, all making the same assumption:
| Step | What it did | Voice answer = no [N] markers → |
|---|---|---|
_extract_all_citations_from_response | Finds inline [N] markers, builds Citation list | Returns [] |
_deduplicate_and_renumber_citations | Walks [N] markers to remap them | Wipes whatever was there |
_qs_cache_response | Caches the (empty) result | Future hits return empty cache |
Voice answers contain no inline markers because the voice system prompt strips them — they're un-speakable. Chat answers do have markers, so the chat path was fine; voice was structurally broken.
Fix (three commits):
d130df74— voice fallback in_qs_finalize: when channel=voice + no markers + chunks retrieved, derive 5 citations from the top retrieved chunks directly.3cd5cc2f— skip dedup whenre.search(r"\[\d+\]", response)is empty. The dedup helper was unconditionally wiping fallback citations.11a51ab2— drop the temporary diagnostic log line after root-cause confirmed.
Operator runbook addendum: always flush app.semantic_query_cache after any change to the citation pipeline. Stale cache entries from before the fix poison new traffic for hours otherwise.
The contract test for this is in memory feedback-voice-citation-pipeline.md. Verified end-to-end on pilot image zol-rag-app:11a51ab2: two consecutive voice turns each persisted with 5 citations including real (document_id, chunk_index) pairs.
Graduated medical-advice refusal (013f7402)
Pre-fix, the safety layer's refusal copy was the same for every medical-advice query: "Bel uw huisarts of bel 112." That works for an emergency; it's needlessly alarming for "I have a small headache." The refusal was rebuilt as a graduated escalation:
- First touch: "Voor persoonlijk medisch advies kunt u terecht bij uw huisarts."
- Out-of-hours: "Buiten de kantooruren is er de huisartsenwachtdienst op 1733."
- Emergency: "Bij een dringend probleem belt u 112."
The escalation is selected by intent-classifier signal (general medical question → ladder 1; urgency cue in query → ladder 2 or 3). Cross-language: nl/en/fr/it all updated in parallel per the project rule.
Value Dashboard v1.0 — customer-facing cost-reduction analytics
The 23-task plan landed across May 7-9. Branch: feat/value-dashboard-v1, ~30 commits. Spec: docs/superpowers/specs/2026-05-06-value-dashboard-design.md. Plan: docs/superpowers/plans/2026-05-07-value-dashboard.md.
What ships
- Per-tenant
/valuepage — three headline tiles (calls handled / minutes deflected / after-hours coverage), period filter (cumulative / today / week / month / quarter / ytd), channel filter (all / voice / chat), daily-volume stacked area chart, cost-model attribution banner. - Platform-owner
/platform/valuepage — cross-tenant rollup with churn warnings, per-tenant € saved, aggregate row. - Cost-model resolver — 4-tier fallback chain (customer override → country industry default → 30-day self-baseline → global fallback), 60s in-process cache. Belgian healthcare default: €0.62/min × 3.2 min from BFHW 2024.
- Customer override API —
POST /api/v1/admin/cost-modelfor hospital admins to set their own cost numbers. - Nightly aggregator — APScheduler at 02:30 UTC, three phases: backfill conversation outcome/is_after_hours/duration → per-channel daily upsert →
_allchannel rollup. Idempotent. 12-month historical backfill script.
Migration 065 — schema additions
| Object | Purpose |
|---|---|
app.tenants.country_code (new column) | Drives the country industry default lookup |
app.tenant_business_hours | Per-tenant operational hours; drives "after-hours" classification |
app.tenant_cost_models | Append-only history, is_active partial unique index |
app.daily_tenant_metrics | Pre-aggregated rollup; the dashboard reads ONLY from this table |
app.conversations (4 new columns) | duration_seconds, outcome, is_after_hours, dominant_intent |
The dashboard's read path goes exclusively through daily_tenant_metrics. Any single-row aggregation cost is paid by the nightly cron, not by the user-facing page load. Sub-500ms p95 dashboard load.
Email digest — deferred
The original spec included a Resend-based monthly digest (v1.5). It was dropped from this release. Reasoning captured in the spec: the dashboard alone is enough proof-of-value for the pilot, and email brings deliverability + GDPR compliance work that isn't on the critical path. Re-evaluate post-pilot if engagement metrics indicate passive-dashboard usage is dropping.
What we learned
Three plan-level corrections were caught only at the integration boundary, not at the unit-test boundary:
Tenant.country_codewas missing from the ORM — only the migration had it. Tests usingBase.metadata.create_allnever noticed. Pilot deploy withINSERT INTO tenants(country_code,...)would have failed. Caught during the resolver task.include_citations: Truewas the default — but the affinity rerank was clobbering them. Caught during smoke testing with non-cached queries.token.subdoesn't exist — the project usestoken.user_id. Caught during the cost-model POST integration test.
Each of these is now permanently pinned in an integration test.
Value Framework — closing the wheelchair-conflation class of bugs
A May 9 morning smoke test exposed a regression that no automated test had caught: a caller asked about wheelchair availability at the entrance, and the agent answered with orthopedic-device reimbursement information. The v2 diagnostic correctly flagged it as a generation failure with citation evidence. The user pushed back on a hospital-specific prompt patch — the project is multi-tenant SaaS, and patching one tenant's corpus through prompts would just defer the failure to the next tenant.
The bug is one instance of a recurring class: cross-category contamination at retrieval+generation time. A query of intent type X retrieves chunks of mixed content categories (some genuinely X-typed, some lexically related but category-mismatched). The LLM, given all chunks in context, fuses them.
The framework
Five mechanisms, all data-agnostic, all live as of 79f517bd:
1. Six canonical content categories
A keyword-based classifier in app/services/value_framework/category_classifier.py assigns one of six labels to each retrieved chunk:
practical · clinical_info · regulatory · appointments · legal_admin · general
Multi-language keyword sets (nl/en/fr/it). Hospital-agnostic — no specific filenames, no specific document paths, just linguistic signals (parkeren / behandeling / terugbetaling / afspraak / etc.).
2. Intent-to-category affinity rerank (Fix A — the architectural keystone)
For each query intent class, an affinity matrix multiplies chunk scores by category-specific multipliers. Defaults:
| Intent | practical | clinical_info | regulatory | appointments | legal_admin | general |
|---|---|---|---|---|---|---|
| navigation_or_practical_info | 1.30 | 0.65 | 0.55 | 1.05 | 0.85 | 1.00 |
| medical_information | 0.75 | 1.25 | 1.05 | 0.95 | 0.85 | 1.00 |
| appointment_scheduling | 1.05 | 0.80 | 0.75 | 1.30 | 0.95 | 1.00 |
| billing_or_insurance | 0.85 | 0.85 | 1.30 | 0.95 | 1.10 | 1.00 |
| ... | ... | ... | ... | ... | ... | ... |
Pilot evidence: a chunk scoring 0.85 (orthopedic, regulatory) and another at 0.65 (parking, practical) flip after the navigation rerank to 0.47 vs 0.85 — exactly the inversion needed.
3. Primary-category prompt constraint (Fix B)
After rerank, the dominant category of the top-K is computed and injected into the system prompt as one line:
De gevonden bronnen behoren hoofdzakelijk tot de categorie 'X'. Beantwoord de vraag uitsluitend vanuit deze categorie. Vermeng GEEN informatie uit andere categorieën, zelfs niet als ze toevallig hetzelfde sleutelwoord bevatten.
The LLM enforces category coherence at generation time. Hospital-agnostic.
4. Unit-mismatch admission (Fix C)
A linguistic detector (detect_unit_mismatch) flags when the user query specifies a unit (per minuut, per kg, per dag, per kWh) and no retrieved chunk contains that exact unit. The system prompt is then augmented with:
De gebruiker vraagt naar een tarief 'per minute', maar de bronnen vermelden enkel per kwh. Zeg dit expliciet en geef het tarief in de eenheid die WEL bekend is — substitueer NIET stilzwijgend.
Caught the silent substitution bug from this morning's call.
5. category_mismatch_rate telemetry (Fix E)
Migration 066 added app.category_mismatch_telemetry. Every voice/chat turn now writes one row capturing intent class, primary category after rerank, mismatch rate (fraction of top-K outside the preferred set), chunk counts, query preview.
The wheelchair regression test
backend/tests/integration/services/test_value_framework_wheelchair_regression.py — 8 tests across 3 tiers:
- Tier 1 (mechanism, deterministic): orthopedics → REGULATORY, parking → PRACTICAL, affinity flips ordering, primary_category = PRACTICAL after rerank, unit-mismatch detected for
per minuutvsper kWh, low mismatch rate after correct rerank. - Tier 2 (telemetry, testcontainer DB): end-to-end framework call writes one row to
category_mismatch_telemetry. - Tier 3 (prompt-string contract): the framework guidance block contains the right primary-category instruction + unit-mismatch admission text.
Verified end-to-end on pilot image zol-rag-app:79f517bd: query "Zijn er rolstoelen ter beschikking aan de ingang?" → answer "Ja, in de inkomhal en op de bezoekersparking ... waarborg van één euro die u na gebruik terugkrijgt" — the correct answer, with mismatch_rate=0.000 and primary_category=practical recorded in telemetry.
Operations dashboard — retrieval health visualization
GET /api/v1/admin/ops/category-mismatch?days=N returns a daily rollup with avg + p95 mismatch rate, top off-category, top intent. The chart component (CategoryMismatchTrend.tsx) renders on /analytics/system (Operations) under the Costs tab — operators viewing cost metrics see retrieval health alongside.
When the avg line spikes for a tenant or an intent, the affinity map needs retuning. Currently the map is a Python module constant; the next iteration will move per-tenant overrides into app.intent_category_affinity (table not yet created).
Methodology shift — codified 2026-05-09
Going forward, every fix or improvement must be covered by a meaningful integration test. We no longer manually re-test what we have already discovered and addressed in code. The wheelchair regression alone added 76 framework unit tests, 8 wheelchair-specific assertions, and 2 operations endpoint integration tests.
Two embarrassing-but-instructive misses in the same session were caught only by pilot deploy and not by pytest:
- The migration file (
066_category_mismatch_telemetry.py) was on disk but nevergit added. - The entire
app/services/value_framework/package was on disk but nevergit added.
Local pytest passed because pytest reads from disk regardless of git state, and autouse table-creation fixtures bypassed the migration. Pilot's clean git pull + docker build + alembic upgrade head is the first environment that catches this.
Pre-flight from now on includes git status confirming zero untracked production files before deploy.
Numbers
| Metric | Before | After |
|---|---|---|
| v2 diagnostic dashboard cards rendered | never (always v1 markdown fallback) | rendered on every investigation |
Voice turn citations column | always NULL | populated with top retrieved chunks |
| Wheelchair-question correctness | wrong (orthopedic reimbursement answer) | correct (loan wheelchairs at entrance) |
| Per-turn telemetry rows | 0 | 1 per RAG turn |
| Customer-facing cost-reduction surface | did not exist | /value + /platform/value live |
| Backend tests | ~4,800 | ~4,900 (+100 framework + dashboard) |
| Pilot DB migration head | 064 | 066 |
What's queued for next release
- Per-tenant affinity override table — currently the map is module-level; tenants in fields with non-default content distributions need DB-backed overrides.
- Email digest pipeline — Resend integration, monthly cron, opt-out, S4U platform-owner digest. Was v1.5; deferred.
- Trend endpoint for the dashboard chart —
DailyVolumeTrendcurrently receives an empty array (the data point exists indaily_tenant_metricsbut no API surface yet). - Value Dashboard v2 polish — PDF export, language drill-down, custom date-range picker, business-hours admin UI.
- Twilio Phase B — pilot DNS + Let's Encrypt SIPS + firewall rules. ADR-0050 has the runbook.
- Voice smoke-test script automation — the 12-turn manual test shipped in this release should eventually run as an automated audio-loop evaluation.