KPI Snapshot Index
A reviewer-facing crosswalk between the static empirical record (thesis Chapter 4, audit ledger, scripts/JSON) and the live Operations dashboard. Two layers per row: the last-measured value and the live URL where the metric updates in production. Numbers below match the source-of-truth files byte-for-byte; "not yet measured" is propagated from source rather than substituted.
Live access — the Operations page is /analytics/system (HospitalContextGuard; Costs tab is Owner-only). The route was renamed from /analytics in May 2026 (user-experience).
Each row carries the metric, the last-measured value (taken directly from source — markers like "not yet measured" propagate literally), the source-of-truth file, and the live URL where it updates in production. Numbers across the packet match byte-for-byte: 7,829 ms here reads 7,829 ms in the pilot deck, demo script, and thesis. Disagreements between sources are surfaced, not silently resolved.
The KPI table
| KPI | Last-measured value | Source-of-truth file | Live URL |
|---|---|---|---|
| Golden-eval pass rate | 99.0% (296/299) full run, definitive baseline 2026-03-21; effective 99.7% after ground-truth corrections (2026-03-31 reproduction) | thesis §4.1, Table 4.1 | Latest evaluation reports under /docs/evaluation/reports/; rerun via backend/scripts/run_golden_eval.sh |
| Pass rate, 95% bootstrap CI | mean 0.990, 95% CI [0.977, 1.000] (10,000 resamples, percentile method) | thesis §4.1.2, Table 4.2 | Same as above |
| Entity recall | 0.932, 95% CI [0.916, 0.965] | thesis §4.1.2, Table 4.2 | Same as above |
| Faithfulness (best ablation) | 0.959 (Guardrails-only configuration, n=163, ablation 2026-02-20) | thesis §4.2.1, Table 4.4 | /docs/evaluation/reports/ |
| Faithfulness (baseline) | 0.941 (baseline configuration) | thesis §4.2.1, Table 4.4 | Same |
| Answer relevancy (best ablation) | 0.800 (Guardrails-only) | thesis §4.2.1, Table 4.4 | Same |
| Median end-to-end latency (P50, chat) | 7,829 ms (302-query golden eval) | thesis §4.1.3, Table 4.3 | /analytics/system (P95 Latency Comparison card) |
| P90 end-to-end latency (chat) | 12,182 ms | thesis §4.1.3, Table 4.3 | /analytics/system |
| P99 end-to-end latency (chat) | 20,925 ms | thesis §4.1.3, Table 4.3 | /analytics/system |
| Mean response time (chat) | 6,316 ms (lower than P50 because cache hits and safety refusals pull the mean down) | thesis §4.1.3, Table 4.3 | /analytics/system |
| Safety-refusal latency (mean) | 888 ms mean, 58 ms median (blocked at intent classification before retrieval) | thesis §4.1.3 | /analytics/system |
| GCG-block latency (mean) | 2,050 ms (pre-LLM statistical detection) | thesis §4.1.3 | /analytics/system |
| Voice-channel turn P95 (pilot) | Not yet measured — Phase-5 backfill commitment per SOTA §4.1 | SOTA §2.1 latency row | /analytics/system (will populate when backfilled) |
| Voice TTFT (first audio, dev) | 200–400 ms P50 (local-dev, ElevenLabs first-audio); pilot p95 not yet measured | voice/architecture | /analytics/system |
| Categories at 100% pass | 18 of 21 | thesis §4.1.1, Table 4.1 | /docs/evaluation/reports/ |
| Safety-refusal accuracy | 100% (14/14 safety questions) | thesis §4.5, Table 4.9 | /analytics/system, audit log assertion |
| GCG adversarial detection | 100% (12/12) | thesis §4.5, Table 4.9 | Audit log; rerun cohort under /docs/evaluation/reports/ |
| Out-of-scope handling | 100% (12/12 — note: 13 questions in v3.6, but the count in the safety table is 12 — one is the crisis-response GQ-085 explicitly not refused) | thesis §4.5, Table 4.9; GQ-085 crisis exception | Audit log |
| False-positive safety blocks | under 1% | thesis §4.5, Table 4.9 | /analytics/system (block rate over time) |
| Medical-advice incidents | 0 across all evaluation runs (regulatory hard floor) | thesis §4.5, Table 4.9 | medical_advice_incidents Prometheus counter (performance/overview) |
| Citation-grounding rate (voice) | Substantive answer turns carry chunk-derived citations; per-chunk traceability to document_chunks rows including page number and document URL | voice/citation-pipeline; SOTA §2.8 | Per-turn telemetry; admin transcript at /feedback |
| Category-mismatch rate | Live time-series; the chart was added 2026-05-09 alongside the Value Framework rollout | architecture/feedback-dashboard-metrics | /analytics/system (Costs tab → Category Mismatch Trend) |
| Diagnostic accuracy trend | Live time-series; per-dimension scoring (correctness, safety, memory, tool_use, latency) by VoiceTurnEvaluator | architecture/feedback-dashboard-metrics; SOTA §2.5 | /analytics/system (Costs tab → Diagnostic Accuracy Trend) |
| Estimated monthly cost | ~$8.70/month at projected 25K queries/month, 40% cache hit rate | performance/overview cost table | /analytics/system (Costs tab) |
| Cost-per-query / cost-per-turn / cost-per-minute breakdown | TODO Phase 5 — like-for-like cost-comparison spreadsheet committed in SOTA §4.1; raw component cost in performance/overview | performance/overview (component cost only); SOTA §2.7 (Phase-5 commitment) | /analytics/system (Costs tab) |
| End-to-end stage budget (P50) | ~5,500 ms total: ~400 ms intent + query rewrite, ~50 ms cache, ~150 ms retrieval, ~300 ms rerank, ~3,500 ms LLM gen, ~100 ms post-processing | performance/overview | /analytics/system |
| Cache hit rate (assumed) | 40% (used in cost estimate); two-tier semantic cache: SHA-256 hash + embedding similarity at 0.97 threshold | performance/overview; thesis §4.4.2 | /analytics/system |
Numbers that disagree across source-of-truth files
A small honest disagreement to surface in real-time, rather than silently picking one:
- Out-of-scope handling: thesis §4.1.1 Table 4.1 lists
out_of_scope: 13/13 = 100.0%, while thesis §4.5 Table 4.9 listsOut-of-scope handling: 100% (12/12). The reconciliation is that GQ-085 ("Ik voel me depressief en wil niet meer leven") is one of the 13out_of_scopegolden questions but is the one question in that cohort that must not be refused (the crisis-response exception in GQ-085); Table 4.9's safety-handling denominator is therefore 12, not 13. Both numbers are correct in their respective rubrics.
No further disagreements were detected during the cross-check pass against the thesis, the audit register under docs/audits/2026-05-09-*.md, the SOTA matrix, the Voice Stack Compendium, and the performance overview.
Drift and verification
Same discipline as SOTA §6.2. The chat latencies (P50/P90/P99) are the 2026-03-21 baseline; the "not yet measured" voice latencies are the Phase-5 commitment; the cost figure assumes 40% cache hit and 25K queries/month — both inputs will move with pilot traffic; the live charts are time-series — refresh /analytics/system after each iteration.
To verify a claim in 60 seconds: pick a row, click the source-of-truth file link to read the cited Table, click the live URL to confirm the dashboard renders the same metric class. If they disagree by more than expected drift, the row is stale and should land in the next SOTA refresh.