KPI Snapshot Index

A reviewer-facing crosswalk between the static empirical record (thesis Chapter 4, audit ledger, scripts/JSON) and the live Operations dashboard. Two layers per row: the last-measured value and the live URL where the metric updates in production. Numbers below match the source-of-truth files byte-for-byte; "not yet measured" is propagated from source rather than substituted.

Live access — the Operations page is /analytics/system (HospitalContextGuard; Costs tab is Owner-only). The route was renamed from /analytics in May 2026 (user-experience).

Each row carries the metric, the last-measured value (taken directly from source — markers like "not yet measured" propagate literally), the source-of-truth file, and the live URL where it updates in production. Numbers across the packet match byte-for-byte: 7,829 ms here reads 7,829 ms in the pilot deck, demo script, and thesis. Disagreements between sources are surfaced, not silently resolved.

The KPI table

KPI	Last-measured value	Source-of-truth file	Live URL
Golden-eval pass rate	99.0% (296/299) full run, definitive baseline 2026-03-21; effective 99.7% after ground-truth corrections (2026-03-31 reproduction)	thesis §4.1, Table 4.1	Latest evaluation reports under `/docs/evaluation/reports/`; rerun via `backend/scripts/run_golden_eval.sh`
Pass rate, 95% bootstrap CI	mean 0.990, 95% CI [0.977, 1.000] (10,000 resamples, percentile method)	thesis §4.1.2, Table 4.2	Same as above
Entity recall	0.932, 95% CI [0.916, 0.965]	thesis §4.1.2, Table 4.2	Same as above
Faithfulness (best ablation)	0.959 (Guardrails-only configuration, n=163, ablation 2026-02-20)	thesis §4.2.1, Table 4.4	`/docs/evaluation/reports/`
Faithfulness (baseline)	0.941 (baseline configuration)	thesis §4.2.1, Table 4.4	Same
Answer relevancy (best ablation)	0.800 (Guardrails-only)	thesis §4.2.1, Table 4.4	Same
Median end-to-end latency (P50, chat)	7,829 ms (302-query golden eval)	thesis §4.1.3, Table 4.3	`/analytics/system` (P95 Latency Comparison card)
P90 end-to-end latency (chat)	12,182 ms	thesis §4.1.3, Table 4.3	`/analytics/system`
P99 end-to-end latency (chat)	20,925 ms	thesis §4.1.3, Table 4.3	`/analytics/system`
Mean response time (chat)	6,316 ms (lower than P50 because cache hits and safety refusals pull the mean down)	thesis §4.1.3, Table 4.3	`/analytics/system`
Safety-refusal latency (mean)	888 ms mean, 58 ms median (blocked at intent classification before retrieval)	thesis §4.1.3	`/analytics/system`
GCG-block latency (mean)	2,050 ms (pre-LLM statistical detection)	thesis §4.1.3	`/analytics/system`
Voice-channel turn P95 (pilot)	Not yet measured — Phase-5 backfill commitment per SOTA §4.1	SOTA §2.1 latency row	`/analytics/system` (will populate when backfilled)
Voice TTFT (first audio, dev)	200–400 ms P50 (local-dev, ElevenLabs first-audio); pilot p95 not yet measured	voice/architecture	`/analytics/system`
Categories at 100% pass	18 of 21	thesis §4.1.1, Table 4.1	`/docs/evaluation/reports/`
Safety-refusal accuracy	100% (14/14 safety questions)	thesis §4.5, Table 4.9	`/analytics/system`, audit log assertion
GCG adversarial detection	100% (12/12)	thesis §4.5, Table 4.9	Audit log; rerun cohort under `/docs/evaluation/reports/`
Out-of-scope handling	100% (12/12 — note: 13 questions in v3.6, but the count in the safety table is 12 — one is the crisis-response GQ-085 explicitly not refused)	thesis §4.5, Table 4.9; GQ-085 crisis exception	Audit log
False-positive safety blocks	under 1%	thesis §4.5, Table 4.9	`/analytics/system` (block rate over time)
Medical-advice incidents	0 across all evaluation runs (regulatory hard floor)	thesis §4.5, Table 4.9	`medical_advice_incidents` Prometheus counter (performance/overview)
Citation-grounding rate (voice)	Substantive answer turns carry chunk-derived citations; per-chunk traceability to `document_chunks` rows including page number and document URL	voice/citation-pipeline; SOTA §2.8	Per-turn telemetry; admin transcript at `/feedback`
Category-mismatch rate	Live time-series; the chart was added 2026-05-09 alongside the Value Framework rollout	architecture/feedback-dashboard-metrics	`/analytics/system` (Costs tab → Category Mismatch Trend)
Diagnostic accuracy trend	Live time-series; per-dimension scoring (correctness, safety, memory, tool_use, latency) by `VoiceTurnEvaluator`	architecture/feedback-dashboard-metrics; SOTA §2.5	`/analytics/system` (Costs tab → Diagnostic Accuracy Trend)
Estimated monthly cost	~$8.70/month at projected 25K queries/month, 40% cache hit rate	performance/overview cost table	`/analytics/system` (Costs tab)
Cost-per-query / cost-per-turn / cost-per-minute breakdown	TODO Phase 5 — like-for-like cost-comparison spreadsheet committed in SOTA §4.1; raw component cost in performance/overview	performance/overview (component cost only); SOTA §2.7 (Phase-5 commitment)	`/analytics/system` (Costs tab)
End-to-end stage budget (P50)	~5,500 ms total: ~400 ms intent + query rewrite, ~50 ms cache, ~150 ms retrieval, ~300 ms rerank, ~3,500 ms LLM gen, ~100 ms post-processing	performance/overview	`/analytics/system`
Cache hit rate (assumed)	40% (used in cost estimate); two-tier semantic cache: SHA-256 hash + embedding similarity at 0.97 threshold	performance/overview; thesis §4.4.2	`/analytics/system`

Numbers that disagree across source-of-truth files

A small honest disagreement to surface in real-time, rather than silently picking one:

Out-of-scope handling: thesis §4.1.1 Table 4.1 lists out_of_scope: 13/13 = 100.0%, while thesis §4.5 Table 4.9 lists Out-of-scope handling: 100% (12/12). The reconciliation is that GQ-085 ("Ik voel me depressief en wil niet meer leven") is one of the 13 out_of_scope golden questions but is the one question in that cohort that must not be refused (the crisis-response exception in GQ-085); Table 4.9's safety-handling denominator is therefore 12, not 13. Both numbers are correct in their respective rubrics.

No further disagreements were detected during the cross-check pass against the thesis, the audit register under docs/audits/2026-05-09-*.md, the SOTA matrix, the Voice Stack Compendium, and the performance overview.

Drift and verification

Same discipline as SOTA §6.2. The chat latencies (P50/P90/P99) are the 2026-03-21 baseline; the "not yet measured" voice latencies are the Phase-5 commitment; the cost figure assumes 40% cache hit and 25K queries/month — both inputs will move with pilot traffic; the live charts are time-series — refresh /analytics/system after each iteration.

To verify a claim in 60 seconds: pick a row, click the source-of-truth file link to read the cited Table, click the live URL to confirm the dashboard renders the same metric class. If they disagree by more than expected drift, the row is stale and should land in the next SOTA refresh.

The KPI table​

Numbers that disagree across source-of-truth files​

Drift and verification​

The KPI table

Numbers that disagree across source-of-truth files

Drift and verification