Release Notes: May 4 – 9, 2026

v2 Diagnostic, Value Dashboard, Value Framework

This release is the first since the May 4 voice-pipeline cut. Five days of intensive work that closes three classes of bugs at the architectural level (no per-tenant patching, no prompt babysitting) and ships the customer-facing analytics surface the pilot has been waiting for.

The headline themes:

Voice citation pipeline structurally repaired (May 7) — three downstream helpers all silently assumed inline [N] markers; voice answers strip them. Result: every voice turn was persisting citations=NULL and the v2 diagnostic had no chunk evidence to validate. Three commits, one cascade closed, generic fix.
v2 Diagnostic redesigned end-to-end (May 7) — per-dimension scoring, citation validator, adversarial counter-evidence pass. The "feels brittle" report on May 9 was traced to schema-validation silent fallback; fixed in one line with response_format=json_object.
Graduated medical-advice refusal (May 7) — the safety layer no longer says "call 112" for everything; it walks the helpdesk → GP → out-of-hours GP → 112 ladder.
Value Dashboard v1.0 shipped (May 7-9) — 23-task plan, customer-facing /value page + /platform/value cross-tenant rollup. Belgian healthcare cost model with 4-tier fallback resolver. Migration 065. Email digest deferred to a future release.
Value Framework — hospital-agnostic retrieval steering (May 9) — closes the wheelchair-conflation class of bug across every future tenant. Intent-to-category affinity rerank, primary-category prompt guard, unit-mismatch admission, per-turn category_mismatch_rate telemetry. New package app.services.value_framework + migration 066. 76 framework unit tests + an 8-test wheelchair regression that pins the morning-of-2026-05-09 pilot bug.
Operations dashboard for retrieval health (May 9) — /api/v1/admin/ops/category-mismatch endpoint surfaces the new metric on the existing /analytics/system page.
Methodology shift (May 9) — every fix lands with at least one integration test that pins the regression. We no longer manually re-test what's already coded and tested.

v2 Diagnostic — full redesign across four phases

The v1 diagnostic was a single LLM blob — one prompt, one verdict, one markdown bullet list. It was right most of the time, wrong on roughly one in five cases, and the failures were unaccountable: there was no way to see which dimension of the answer the LLM judged broken. By May 7 it was clear the pilot needed a richer, more accountable diagnostic.

Phase A — Per-turn chunk evidence + v2 schema scaffold (`46e95895`)

The v2 schema (app/schemas/diagnosis_v2.py) introduces:

A per-call dimension rollup: correctness / safety / memory / tool_use / latency, each with score: 0..1, verdict: pass/partial/fail, evidence: [...].
A per-turn breakdown: each turn carries the same five dimensions plus claims, citations_quoted, and counter_evidence slots.
A model_used field — load-bearing because OpenRouter routing makes self-reported model unreliable.

The Phase A loader path also added (document_id, chunk_index) pair lookup in _collect_chunk_evidence so the diagnostic can find the actual chunks the answer was grounded on.

Phase B — v2 prompt + citation validator + runner (`828c1341`)

run_v2_diagnosis orchestrates: build prompt → call LLM → tolerant JSON parse → validate against DiagnosisV2 Pydantic schema → run citation validator → return (DiagnosisV2, meta). The citation validator strips claims whose quotes don't actually appear in the cited chunks (case + whitespace tolerant matching) and downgrades dimension scores per invalid citation.

Phase C — Adversarial counter-evidence pass (`716bcd27`)

The first-pass v2 diagnosis is biased toward false negatives — the LLM tends to flag the answer as wrong when it had a slight cue of doubt. Phase C runs a second LLM call that tries to disprove each first-pass negative claim. If the disprove pass succeeds, the dimension score is boosted. This asymmetric bias correction closes the residual ~20% accuracy gap from first-pass alone. Per-call cost: ~$0.03.

Phase D — Frontend rendering for v2 dashboard (`e3bdcf87`)

The InvestigationViewV2.tsx component renders the per-dimension cards, per-turn drill-down, counter-evidence panel, and the citation-validated chunk evidence. Falls back to the v1 markdown if dimensional_scores is empty.

Phase D follow-up — `response_format=json_object` (`c1cfa026`, May 9)

A "feels brittle" report from the operator on the morning of May 9 traced to a silent failure: gpt-5.2 (via OpenRouter) was emitting v2-style PROSE, not v2 JSON. The Pydantic schema validation raised ValueError, the admin route caught it, and the fallback persisted dimensional_scores={} — invisible to the v2 UI in perpetuity.

Fix is one line: pin response_format={"type": "json_object"} on the ChatRequest. The model now reliably emits parseable JSON and the v2 cards finally render.

Lesson: silent try/except around schema validation for "best-effort fallback" hid a load-bearing regression for two days. The fix is the constraint, not the catch. Future v2 prompts should also use response_format explicitly.

Voice citation pipeline — three-helper cascade closed

A 2026-05-07 morning smoke test revealed every voice turn persisted with citations=NULL. The cause was a cascade across three helpers in _qs_finalize, each individually correct, all making the same assumption:

Step	What it did	Voice answer = no `[N]` markers →
`_extract_all_citations_from_response`	Finds inline `[N]` markers, builds Citation list	Returns `[]`
`_deduplicate_and_renumber_citations`	Walks `[N]` markers to remap them	Wipes whatever was there
`_qs_cache_response`	Caches the (empty) result	Future hits return empty cache

Voice answers contain no inline markers because the voice system prompt strips them — they're un-speakable. Chat answers do have markers, so the chat path was fine; voice was structurally broken.

Fix (three commits):

d130df74 — voice fallback in _qs_finalize: when channel=voice + no markers + chunks retrieved, derive 5 citations from the top retrieved chunks directly.
3cd5cc2f — skip dedup when re.search(r"\[\d+\]", response) is empty. The dedup helper was unconditionally wiping fallback citations.
11a51ab2 — drop the temporary diagnostic log line after root-cause confirmed.

Operator runbook addendum: always flush app.semantic_query_cache after any change to the citation pipeline. Stale cache entries from before the fix poison new traffic for hours otherwise.

The contract test for this is in memory feedback-voice-citation-pipeline.md. Verified end-to-end on pilot image zol-rag-app:11a51ab2: two consecutive voice turns each persisted with 5 citations including real (document_id, chunk_index) pairs.

Graduated medical-advice refusal (`013f7402`)

Pre-fix, the safety layer's refusal copy was the same for every medical-advice query: "Bel uw huisarts of bel 112." That works for an emergency; it's needlessly alarming for "I have a small headache." The refusal was rebuilt as a graduated escalation:

First touch: "Voor persoonlijk medisch advies kunt u terecht bij uw huisarts."
Out-of-hours: "Buiten de kantooruren is er de huisartsenwachtdienst op 1733."
Emergency: "Bij een dringend probleem belt u 112."

The escalation is selected by intent-classifier signal (general medical question → ladder 1; urgency cue in query → ladder 2 or 3). Cross-language: nl/en/fr/it all updated in parallel per the project rule.

Value Dashboard v1.0 — customer-facing cost-reduction analytics

The 23-task plan landed across May 7-9. Branch: feat/value-dashboard-v1, ~30 commits. Spec: docs/superpowers/specs/2026-05-06-value-dashboard-design.md. Plan: docs/superpowers/plans/2026-05-07-value-dashboard.md.

What ships

Per-tenant /value page — three headline tiles (calls handled / minutes deflected / after-hours coverage), period filter (cumulative / today / week / month / quarter / ytd), channel filter (all / voice / chat), daily-volume stacked area chart, cost-model attribution banner.
Platform-owner /platform/value page — cross-tenant rollup with churn warnings, per-tenant € saved, aggregate row.
Cost-model resolver — 4-tier fallback chain (customer override → country industry default → 30-day self-baseline → global fallback), 60s in-process cache. Belgian healthcare default: €0.62/min × 3.2 min from BFHW 2024.
Customer override API — POST /api/v1/admin/cost-model for hospital admins to set their own cost numbers.
Nightly aggregator — APScheduler at 02:30 UTC, three phases: backfill conversation outcome/is_after_hours/duration → per-channel daily upsert → _all channel rollup. Idempotent. 12-month historical backfill script.

Migration 065 — schema additions

Object	Purpose
`app.tenants.country_code` (new column)	Drives the country industry default lookup
`app.tenant_business_hours`	Per-tenant operational hours; drives "after-hours" classification
`app.tenant_cost_models`	Append-only history, `is_active` partial unique index
`app.daily_tenant_metrics`	Pre-aggregated rollup; the dashboard reads ONLY from this table
`app.conversations` (4 new columns)	`duration_seconds`, `outcome`, `is_after_hours`, `dominant_intent`

The dashboard's read path goes exclusively through daily_tenant_metrics. Any single-row aggregation cost is paid by the nightly cron, not by the user-facing page load. Sub-500ms p95 dashboard load.

Email digest — deferred

The original spec included a Resend-based monthly digest (v1.5). It was dropped from this release. Reasoning captured in the spec: the dashboard alone is enough proof-of-value for the pilot, and email brings deliverability + GDPR compliance work that isn't on the critical path. Re-evaluate post-pilot if engagement metrics indicate passive-dashboard usage is dropping.

What we learned

Three plan-level corrections were caught only at the integration boundary, not at the unit-test boundary:

Tenant.country_code was missing from the ORM — only the migration had it. Tests using Base.metadata.create_all never noticed. Pilot deploy with INSERT INTO tenants(country_code,...) would have failed. Caught during the resolver task.
include_citations: True was the default — but the affinity rerank was clobbering them. Caught during smoke testing with non-cached queries.
token.sub doesn't exist — the project uses token.user_id. Caught during the cost-model POST integration test.

Each of these is now permanently pinned in an integration test.

Value Framework — closing the wheelchair-conflation class of bugs

A May 9 morning smoke test exposed a regression that no automated test had caught: a caller asked about wheelchair availability at the entrance, and the agent answered with orthopedic-device reimbursement information. The v2 diagnostic correctly flagged it as a generation failure with citation evidence. The user pushed back on a hospital-specific prompt patch — the project is multi-tenant SaaS, and patching one tenant's corpus through prompts would just defer the failure to the next tenant.

The bug is one instance of a recurring class: cross-category contamination at retrieval+generation time. A query of intent type X retrieves chunks of mixed content categories (some genuinely X-typed, some lexically related but category-mismatched). The LLM, given all chunks in context, fuses them.

The framework

Five mechanisms, all data-agnostic, all live as of 79f517bd:

1. Six canonical content categories

A keyword-based classifier in app/services/value_framework/category_classifier.py assigns one of six labels to each retrieved chunk:

practical · clinical_info · regulatory · appointments · legal_admin · general

Multi-language keyword sets (nl/en/fr/it). Hospital-agnostic — no specific filenames, no specific document paths, just linguistic signals (parkeren / behandeling / terugbetaling / afspraak / etc.).

2. Intent-to-category affinity rerank (Fix A — the architectural keystone)

For each query intent class, an affinity matrix multiplies chunk scores by category-specific multipliers. Defaults:

Intent	practical	clinical_info	regulatory	appointments	legal_admin	general
navigation_or_practical_info	1.30	0.65	0.55	1.05	0.85	1.00
medical_information	0.75	1.25	1.05	0.95	0.85	1.00
appointment_scheduling	1.05	0.80	0.75	1.30	0.95	1.00
billing_or_insurance	0.85	0.85	1.30	0.95	1.10	1.00
...	...	...	...	...	...	...

Pilot evidence: a chunk scoring 0.85 (orthopedic, regulatory) and another at 0.65 (parking, practical) flip after the navigation rerank to 0.47 vs 0.85 — exactly the inversion needed.

3. Primary-category prompt constraint (Fix B)

After rerank, the dominant category of the top-K is computed and injected into the system prompt as one line:

De gevonden bronnen behoren hoofdzakelijk tot de categorie 'X'. Beantwoord de vraag uitsluitend vanuit deze categorie. Vermeng GEEN informatie uit andere categorieën, zelfs niet als ze toevallig hetzelfde sleutelwoord bevatten.

The LLM enforces category coherence at generation time. Hospital-agnostic.

4. Unit-mismatch admission (Fix C)

A linguistic detector (detect_unit_mismatch) flags when the user query specifies a unit (per minuut, per kg, per dag, per kWh) and no retrieved chunk contains that exact unit. The system prompt is then augmented with:

De gebruiker vraagt naar een tarief 'per minute', maar de bronnen vermelden enkel per kwh. Zeg dit expliciet en geef het tarief in de eenheid die WEL bekend is — substitueer NIET stilzwijgend.

Caught the silent substitution bug from this morning's call.

5. `category_mismatch_rate` telemetry (Fix E)

Migration 066 added app.category_mismatch_telemetry. Every voice/chat turn now writes one row capturing intent class, primary category after rerank, mismatch rate (fraction of top-K outside the preferred set), chunk counts, query preview.

The wheelchair regression test

backend/tests/integration/services/test_value_framework_wheelchair_regression.py — 8 tests across 3 tiers:

Tier 1 (mechanism, deterministic): orthopedics → REGULATORY, parking → PRACTICAL, affinity flips ordering, primary_category = PRACTICAL after rerank, unit-mismatch detected for per minuut vs per kWh, low mismatch rate after correct rerank.
Tier 2 (telemetry, testcontainer DB): end-to-end framework call writes one row to category_mismatch_telemetry.
Tier 3 (prompt-string contract): the framework guidance block contains the right primary-category instruction + unit-mismatch admission text.

Verified end-to-end on pilot image zol-rag-app:79f517bd: query "Zijn er rolstoelen ter beschikking aan de ingang?" → answer "Ja, in de inkomhal en op de bezoekersparking ... waarborg van één euro die u na gebruik terugkrijgt" — the correct answer, with mismatch_rate=0.000 and primary_category=practical recorded in telemetry.

Operations dashboard — retrieval health visualization

GET /api/v1/admin/ops/category-mismatch?days=N returns a daily rollup with avg + p95 mismatch rate, top off-category, top intent. The chart component (CategoryMismatchTrend.tsx) renders on /analytics/system (Operations) under the Costs tab — operators viewing cost metrics see retrieval health alongside.

When the avg line spikes for a tenant or an intent, the affinity map needs retuning. Currently the map is a Python module constant; the next iteration will move per-tenant overrides into app.intent_category_affinity (table not yet created).

Methodology shift — codified 2026-05-09

Going forward, every fix or improvement must be covered by a meaningful integration test. We no longer manually re-test what we have already discovered and addressed in code. The wheelchair regression alone added 76 framework unit tests, 8 wheelchair-specific assertions, and 2 operations endpoint integration tests.

Two embarrassing-but-instructive misses in the same session were caught only by pilot deploy and not by pytest:

The migration file (066_category_mismatch_telemetry.py) was on disk but never git added.
The entire app/services/value_framework/ package was on disk but never git added.

Local pytest passed because pytest reads from disk regardless of git state, and autouse table-creation fixtures bypassed the migration. Pilot's clean git pull + docker build + alembic upgrade head is the first environment that catches this.

Pre-flight from now on includes git status confirming zero untracked production files before deploy.

Numbers

Metric	Before	After
v2 diagnostic dashboard cards rendered	never (always v1 markdown fallback)	rendered on every investigation
Voice turn `citations` column	always NULL	populated with top retrieved chunks
Wheelchair-question correctness	wrong (orthopedic reimbursement answer)	correct (loan wheelchairs at entrance)
Per-turn telemetry rows	0	1 per RAG turn
Customer-facing cost-reduction surface	did not exist	`/value` + `/platform/value` live
Backend tests	~4,800	~4,900 (+100 framework + dashboard)
Pilot DB migration head	064	066

What's queued for next release

Per-tenant affinity override table — currently the map is module-level; tenants in fields with non-default content distributions need DB-backed overrides.
Email digest pipeline — Resend integration, monthly cron, opt-out, S4U platform-owner digest. Was v1.5; deferred.
Trend endpoint for the dashboard chart — DailyVolumeTrend currently receives an empty array (the data point exists in daily_tenant_metrics but no API surface yet).
Value Dashboard v2 polish — PDF export, language drill-down, custom date-range picker, business-hours admin UI.
Twilio Phase B — pilot DNS + Let's Encrypt SIPS + firewall rules. ADR-0050 has the runbook.
Voice smoke-test script automation — the 12-turn manual test shipped in this release should eventually run as an automated audio-loop evaluation.

v2 Diagnostic, Value Dashboard, Value Framework​

v2 Diagnostic — full redesign across four phases​

Phase A — Per-turn chunk evidence + v2 schema scaffold (46e95895)​

Phase B — v2 prompt + citation validator + runner (828c1341)​

Phase C — Adversarial counter-evidence pass (716bcd27)​

Phase D — Frontend rendering for v2 dashboard (e3bdcf87)​

Phase D follow-up — response_format=json_object (c1cfa026, May 9)​

Voice citation pipeline — three-helper cascade closed​

Graduated medical-advice refusal (013f7402)​

Value Dashboard v1.0 — customer-facing cost-reduction analytics​

What ships​

Migration 065 — schema additions​

Email digest — deferred​

What we learned​

Value Framework — closing the wheelchair-conflation class of bugs​

The framework​

1. Six canonical content categories​

2. Intent-to-category affinity rerank (Fix A — the architectural keystone)​

3. Primary-category prompt constraint (Fix B)​

4. Unit-mismatch admission (Fix C)​

5. category_mismatch_rate telemetry (Fix E)​

The wheelchair regression test​

Operations dashboard — retrieval health visualization​

Methodology shift — codified 2026-05-09​

Numbers​

What's queued for next release​