Skip to main content

Release Notes: May 4 – 9, 2026

v2 Diagnostic, Value Dashboard, Value Framework

~50 commits | 5 days | New customer-facing analytics surface | New hospital-agnostic retrieval-steering framework | Voice citation pipeline structurally repaired | Per-dimension diagnostic dashboard finally renders

This release is the first since the May 4 voice-pipeline cut. Five days of intensive work that closes three classes of bugs at the architectural level (no per-tenant patching, no prompt babysitting) and ships the customer-facing analytics surface the pilot has been waiting for.

The headline themes:

  1. Voice citation pipeline structurally repaired (May 7) — three downstream helpers all silently assumed inline [N] markers; voice answers strip them. Result: every voice turn was persisting citations=NULL and the v2 diagnostic had no chunk evidence to validate. Three commits, one cascade closed, generic fix.
  2. v2 Diagnostic redesigned end-to-end (May 7) — per-dimension scoring, citation validator, adversarial counter-evidence pass. The "feels brittle" report on May 9 was traced to schema-validation silent fallback; fixed in one line with response_format=json_object.
  3. Graduated medical-advice refusal (May 7) — the safety layer no longer says "call 112" for everything; it walks the helpdesk → GP → out-of-hours GP → 112 ladder.
  4. Value Dashboard v1.0 shipped (May 7-9) — 23-task plan, customer-facing /value page + /platform/value cross-tenant rollup. Belgian healthcare cost model with 4-tier fallback resolver. Migration 065. Email digest deferred to a future release.
  5. Value Framework — hospital-agnostic retrieval steering (May 9) — closes the wheelchair-conflation class of bug across every future tenant. Intent-to-category affinity rerank, primary-category prompt guard, unit-mismatch admission, per-turn category_mismatch_rate telemetry. New package app.services.value_framework + migration 066. 76 framework unit tests + an 8-test wheelchair regression that pins the morning-of-2026-05-09 pilot bug.
  6. Operations dashboard for retrieval health (May 9) — /api/v1/admin/ops/category-mismatch endpoint surfaces the new metric on the existing /analytics/system page.
  7. Methodology shift (May 9) — every fix lands with at least one integration test that pins the regression. We no longer manually re-test what's already coded and tested.

v2 Diagnostic — full redesign across four phases

The v1 diagnostic was a single LLM blob — one prompt, one verdict, one markdown bullet list. It was right most of the time, wrong on roughly one in five cases, and the failures were unaccountable: there was no way to see which dimension of the answer the LLM judged broken. By May 7 it was clear the pilot needed a richer, more accountable diagnostic.

Phase A — Per-turn chunk evidence + v2 schema scaffold (46e95895)

The v2 schema (app/schemas/diagnosis_v2.py) introduces:

  • A per-call dimension rollup: correctness / safety / memory / tool_use / latency, each with score: 0..1, verdict: pass/partial/fail, evidence: [...].
  • A per-turn breakdown: each turn carries the same five dimensions plus claims, citations_quoted, and counter_evidence slots.
  • A model_used field — load-bearing because OpenRouter routing makes self-reported model unreliable.

The Phase A loader path also added (document_id, chunk_index) pair lookup in _collect_chunk_evidence so the diagnostic can find the actual chunks the answer was grounded on.

Phase B — v2 prompt + citation validator + runner (828c1341)

run_v2_diagnosis orchestrates: build prompt → call LLM → tolerant JSON parse → validate against DiagnosisV2 Pydantic schema → run citation validator → return (DiagnosisV2, meta). The citation validator strips claims whose quotes don't actually appear in the cited chunks (case + whitespace tolerant matching) and downgrades dimension scores per invalid citation.

Phase C — Adversarial counter-evidence pass (716bcd27)

The first-pass v2 diagnosis is biased toward false negatives — the LLM tends to flag the answer as wrong when it had a slight cue of doubt. Phase C runs a second LLM call that tries to disprove each first-pass negative claim. If the disprove pass succeeds, the dimension score is boosted. This asymmetric bias correction closes the residual ~20% accuracy gap from first-pass alone. Per-call cost: ~$0.03.

Phase D — Frontend rendering for v2 dashboard (e3bdcf87)

The InvestigationViewV2.tsx component renders the per-dimension cards, per-turn drill-down, counter-evidence panel, and the citation-validated chunk evidence. Falls back to the v1 markdown if dimensional_scores is empty.

Phase D follow-up — response_format=json_object (c1cfa026, May 9)

A "feels brittle" report from the operator on the morning of May 9 traced to a silent failure: gpt-5.2 (via OpenRouter) was emitting v2-style PROSE, not v2 JSON. The Pydantic schema validation raised ValueError, the admin route caught it, and the fallback persisted dimensional_scores={} — invisible to the v2 UI in perpetuity.

Fix is one line: pin response_format={"type": "json_object"} on the ChatRequest. The model now reliably emits parseable JSON and the v2 cards finally render.

Lesson: silent try/except around schema validation for "best-effort fallback" hid a load-bearing regression for two days. The fix is the constraint, not the catch. Future v2 prompts should also use response_format explicitly.


Voice citation pipeline — three-helper cascade closed

A 2026-05-07 morning smoke test revealed every voice turn persisted with citations=NULL. The cause was a cascade across three helpers in _qs_finalize, each individually correct, all making the same assumption:

StepWhat it didVoice answer = no [N] markers →
_extract_all_citations_from_responseFinds inline [N] markers, builds Citation listReturns []
_deduplicate_and_renumber_citationsWalks [N] markers to remap themWipes whatever was there
_qs_cache_responseCaches the (empty) resultFuture hits return empty cache

Voice answers contain no inline markers because the voice system prompt strips them — they're un-speakable. Chat answers do have markers, so the chat path was fine; voice was structurally broken.

Fix (three commits):

  • d130df74 — voice fallback in _qs_finalize: when channel=voice + no markers + chunks retrieved, derive 5 citations from the top retrieved chunks directly.
  • 3cd5cc2f — skip dedup when re.search(r"\[\d+\]", response) is empty. The dedup helper was unconditionally wiping fallback citations.
  • 11a51ab2 — drop the temporary diagnostic log line after root-cause confirmed.

Operator runbook addendum: always flush app.semantic_query_cache after any change to the citation pipeline. Stale cache entries from before the fix poison new traffic for hours otherwise.

The contract test for this is in memory feedback-voice-citation-pipeline.md. Verified end-to-end on pilot image zol-rag-app:11a51ab2: two consecutive voice turns each persisted with 5 citations including real (document_id, chunk_index) pairs.


Graduated medical-advice refusal (013f7402)

Pre-fix, the safety layer's refusal copy was the same for every medical-advice query: "Bel uw huisarts of bel 112." That works for an emergency; it's needlessly alarming for "I have a small headache." The refusal was rebuilt as a graduated escalation:

  1. First touch: "Voor persoonlijk medisch advies kunt u terecht bij uw huisarts."
  2. Out-of-hours: "Buiten de kantooruren is er de huisartsenwachtdienst op 1733."
  3. Emergency: "Bij een dringend probleem belt u 112."

The escalation is selected by intent-classifier signal (general medical question → ladder 1; urgency cue in query → ladder 2 or 3). Cross-language: nl/en/fr/it all updated in parallel per the project rule.


Value Dashboard v1.0 — customer-facing cost-reduction analytics

The 23-task plan landed across May 7-9. Branch: feat/value-dashboard-v1, ~30 commits. Spec: docs/superpowers/specs/2026-05-06-value-dashboard-design.md. Plan: docs/superpowers/plans/2026-05-07-value-dashboard.md.

What ships

  • Per-tenant /value page — three headline tiles (calls handled / minutes deflected / after-hours coverage), period filter (cumulative / today / week / month / quarter / ytd), channel filter (all / voice / chat), daily-volume stacked area chart, cost-model attribution banner.
  • Platform-owner /platform/value page — cross-tenant rollup with churn warnings, per-tenant € saved, aggregate row.
  • Cost-model resolver — 4-tier fallback chain (customer override → country industry default → 30-day self-baseline → global fallback), 60s in-process cache. Belgian healthcare default: €0.62/min × 3.2 min from BFHW 2024.
  • Customer override APIPOST /api/v1/admin/cost-model for hospital admins to set their own cost numbers.
  • Nightly aggregator — APScheduler at 02:30 UTC, three phases: backfill conversation outcome/is_after_hours/duration → per-channel daily upsert → _all channel rollup. Idempotent. 12-month historical backfill script.

Migration 065 — schema additions

ObjectPurpose
app.tenants.country_code (new column)Drives the country industry default lookup
app.tenant_business_hoursPer-tenant operational hours; drives "after-hours" classification
app.tenant_cost_modelsAppend-only history, is_active partial unique index
app.daily_tenant_metricsPre-aggregated rollup; the dashboard reads ONLY from this table
app.conversations (4 new columns)duration_seconds, outcome, is_after_hours, dominant_intent

The dashboard's read path goes exclusively through daily_tenant_metrics. Any single-row aggregation cost is paid by the nightly cron, not by the user-facing page load. Sub-500ms p95 dashboard load.

Email digest — deferred

The original spec included a Resend-based monthly digest (v1.5). It was dropped from this release. Reasoning captured in the spec: the dashboard alone is enough proof-of-value for the pilot, and email brings deliverability + GDPR compliance work that isn't on the critical path. Re-evaluate post-pilot if engagement metrics indicate passive-dashboard usage is dropping.

What we learned

Three plan-level corrections were caught only at the integration boundary, not at the unit-test boundary:

  1. Tenant.country_code was missing from the ORM — only the migration had it. Tests using Base.metadata.create_all never noticed. Pilot deploy with INSERT INTO tenants(country_code,...) would have failed. Caught during the resolver task.
  2. include_citations: True was the default — but the affinity rerank was clobbering them. Caught during smoke testing with non-cached queries.
  3. token.sub doesn't exist — the project uses token.user_id. Caught during the cost-model POST integration test.

Each of these is now permanently pinned in an integration test.


Value Framework — closing the wheelchair-conflation class of bugs

A May 9 morning smoke test exposed a regression that no automated test had caught: a caller asked about wheelchair availability at the entrance, and the agent answered with orthopedic-device reimbursement information. The v2 diagnostic correctly flagged it as a generation failure with citation evidence. The user pushed back on a hospital-specific prompt patch — the project is multi-tenant SaaS, and patching one tenant's corpus through prompts would just defer the failure to the next tenant.

The bug is one instance of a recurring class: cross-category contamination at retrieval+generation time. A query of intent type X retrieves chunks of mixed content categories (some genuinely X-typed, some lexically related but category-mismatched). The LLM, given all chunks in context, fuses them.

The framework

Five mechanisms, all data-agnostic, all live as of 79f517bd:

1. Six canonical content categories

A keyword-based classifier in app/services/value_framework/category_classifier.py assigns one of six labels to each retrieved chunk:

practical · clinical_info · regulatory · appointments · legal_admin · general

Multi-language keyword sets (nl/en/fr/it). Hospital-agnostic — no specific filenames, no specific document paths, just linguistic signals (parkeren / behandeling / terugbetaling / afspraak / etc.).

2. Intent-to-category affinity rerank (Fix A — the architectural keystone)

For each query intent class, an affinity matrix multiplies chunk scores by category-specific multipliers. Defaults:

Intentpracticalclinical_inforegulatoryappointmentslegal_admingeneral
navigation_or_practical_info1.300.650.551.050.851.00
medical_information0.751.251.050.950.851.00
appointment_scheduling1.050.800.751.300.951.00
billing_or_insurance0.850.851.300.951.101.00
.....................

Pilot evidence: a chunk scoring 0.85 (orthopedic, regulatory) and another at 0.65 (parking, practical) flip after the navigation rerank to 0.47 vs 0.85 — exactly the inversion needed.

3. Primary-category prompt constraint (Fix B)

After rerank, the dominant category of the top-K is computed and injected into the system prompt as one line:

De gevonden bronnen behoren hoofdzakelijk tot de categorie 'X'. Beantwoord de vraag uitsluitend vanuit deze categorie. Vermeng GEEN informatie uit andere categorieën, zelfs niet als ze toevallig hetzelfde sleutelwoord bevatten.

The LLM enforces category coherence at generation time. Hospital-agnostic.

4. Unit-mismatch admission (Fix C)

A linguistic detector (detect_unit_mismatch) flags when the user query specifies a unit (per minuut, per kg, per dag, per kWh) and no retrieved chunk contains that exact unit. The system prompt is then augmented with:

De gebruiker vraagt naar een tarief 'per minute', maar de bronnen vermelden enkel per kwh. Zeg dit expliciet en geef het tarief in de eenheid die WEL bekend is — substitueer NIET stilzwijgend.

Caught the silent substitution bug from this morning's call.

5. category_mismatch_rate telemetry (Fix E)

Migration 066 added app.category_mismatch_telemetry. Every voice/chat turn now writes one row capturing intent class, primary category after rerank, mismatch rate (fraction of top-K outside the preferred set), chunk counts, query preview.

The wheelchair regression test

backend/tests/integration/services/test_value_framework_wheelchair_regression.py — 8 tests across 3 tiers:

  • Tier 1 (mechanism, deterministic): orthopedics → REGULATORY, parking → PRACTICAL, affinity flips ordering, primary_category = PRACTICAL after rerank, unit-mismatch detected for per minuut vs per kWh, low mismatch rate after correct rerank.
  • Tier 2 (telemetry, testcontainer DB): end-to-end framework call writes one row to category_mismatch_telemetry.
  • Tier 3 (prompt-string contract): the framework guidance block contains the right primary-category instruction + unit-mismatch admission text.

Verified end-to-end on pilot image zol-rag-app:79f517bd: query "Zijn er rolstoelen ter beschikking aan de ingang?" → answer "Ja, in de inkomhal en op de bezoekersparking ... waarborg van één euro die u na gebruik terugkrijgt" — the correct answer, with mismatch_rate=0.000 and primary_category=practical recorded in telemetry.


Operations dashboard — retrieval health visualization

GET /api/v1/admin/ops/category-mismatch?days=N returns a daily rollup with avg + p95 mismatch rate, top off-category, top intent. The chart component (CategoryMismatchTrend.tsx) renders on /analytics/system (Operations) under the Costs tab — operators viewing cost metrics see retrieval health alongside.

When the avg line spikes for a tenant or an intent, the affinity map needs retuning. Currently the map is a Python module constant; the next iteration will move per-tenant overrides into app.intent_category_affinity (table not yet created).


Methodology shift — codified 2026-05-09

Going forward, every fix or improvement must be covered by a meaningful integration test. We no longer manually re-test what we have already discovered and addressed in code. The wheelchair regression alone added 76 framework unit tests, 8 wheelchair-specific assertions, and 2 operations endpoint integration tests.

Two embarrassing-but-instructive misses in the same session were caught only by pilot deploy and not by pytest:

  • The migration file (066_category_mismatch_telemetry.py) was on disk but never git added.
  • The entire app/services/value_framework/ package was on disk but never git added.

Local pytest passed because pytest reads from disk regardless of git state, and autouse table-creation fixtures bypassed the migration. Pilot's clean git pull + docker build + alembic upgrade head is the first environment that catches this.

Pre-flight from now on includes git status confirming zero untracked production files before deploy.


Numbers

MetricBeforeAfter
v2 diagnostic dashboard cards renderednever (always v1 markdown fallback)rendered on every investigation
Voice turn citations columnalways NULLpopulated with top retrieved chunks
Wheelchair-question correctnesswrong (orthopedic reimbursement answer)correct (loan wheelchairs at entrance)
Per-turn telemetry rows01 per RAG turn
Customer-facing cost-reduction surfacedid not exist/value + /platform/value live
Backend tests~4,800~4,900 (+100 framework + dashboard)
Pilot DB migration head064066

What's queued for next release

  • Per-tenant affinity override table — currently the map is module-level; tenants in fields with non-default content distributions need DB-backed overrides.
  • Email digest pipeline — Resend integration, monthly cron, opt-out, S4U platform-owner digest. Was v1.5; deferred.
  • Trend endpoint for the dashboard chartDailyVolumeTrend currently receives an empty array (the data point exists in daily_tenant_metrics but no API surface yet).
  • Value Dashboard v2 polish — PDF export, language drill-down, custom date-range picker, business-hours admin UI.
  • Twilio Phase B — pilot DNS + Let's Encrypt SIPS + firewall rules. ADR-0050 has the runbook.
  • Voice smoke-test script automation — the 12-turn manual test shipped in this release should eventually run as an automated audio-loop evaluation.