Skip to main content

KPI Snapshot Index

A reviewer-facing crosswalk between the static empirical record (thesis Chapter 4, audit ledger, scripts/JSON) and the live Operations dashboard. Two layers per row: the last-measured value and the live URL where the metric updates in production. Numbers below match the source-of-truth files byte-for-byte; "not yet measured" is propagated from source rather than substituted.

Live access — the Operations page is /analytics/system (HospitalContextGuard; Costs tab is Owner-only). The route was renamed from /analytics in May 2026 (user-experience).

Each row carries the metric, the last-measured value (taken directly from source — markers like "not yet measured" propagate literally), the source-of-truth file, and the live URL where it updates in production. Numbers across the packet match byte-for-byte: 7,829 ms here reads 7,829 ms in the pilot deck, demo script, and thesis. Disagreements between sources are surfaced, not silently resolved.

The KPI table

KPILast-measured valueSource-of-truth fileLive URL
Golden-eval pass rate99.0% (296/299) full run, definitive baseline 2026-03-21; effective 99.7% after ground-truth corrections (2026-03-31 reproduction)thesis §4.1, Table 4.1Latest evaluation reports under /docs/evaluation/reports/; rerun via backend/scripts/run_golden_eval.sh
Pass rate, 95% bootstrap CImean 0.990, 95% CI [0.977, 1.000] (10,000 resamples, percentile method)thesis §4.1.2, Table 4.2Same as above
Entity recall0.932, 95% CI [0.916, 0.965]thesis §4.1.2, Table 4.2Same as above
Faithfulness (best ablation)0.959 (Guardrails-only configuration, n=163, ablation 2026-02-20)thesis §4.2.1, Table 4.4/docs/evaluation/reports/
Faithfulness (baseline)0.941 (baseline configuration)thesis §4.2.1, Table 4.4Same
Answer relevancy (best ablation)0.800 (Guardrails-only)thesis §4.2.1, Table 4.4Same
Median end-to-end latency (P50, chat)7,829 ms (302-query golden eval)thesis §4.1.3, Table 4.3/analytics/system (P95 Latency Comparison card)
P90 end-to-end latency (chat)12,182 msthesis §4.1.3, Table 4.3/analytics/system
P99 end-to-end latency (chat)20,925 msthesis §4.1.3, Table 4.3/analytics/system
Mean response time (chat)6,316 ms (lower than P50 because cache hits and safety refusals pull the mean down)thesis §4.1.3, Table 4.3/analytics/system
Safety-refusal latency (mean)888 ms mean, 58 ms median (blocked at intent classification before retrieval)thesis §4.1.3/analytics/system
GCG-block latency (mean)2,050 ms (pre-LLM statistical detection)thesis §4.1.3/analytics/system
Voice-channel turn P95 (pilot)Not yet measured — Phase-5 backfill commitment per SOTA §4.1SOTA §2.1 latency row/analytics/system (will populate when backfilled)
Voice TTFT (first audio, dev)200–400 ms P50 (local-dev, ElevenLabs first-audio); pilot p95 not yet measuredvoice/architecture/analytics/system
Categories at 100% pass18 of 21thesis §4.1.1, Table 4.1/docs/evaluation/reports/
Safety-refusal accuracy100% (14/14 safety questions)thesis §4.5, Table 4.9/analytics/system, audit log assertion
GCG adversarial detection100% (12/12)thesis §4.5, Table 4.9Audit log; rerun cohort under /docs/evaluation/reports/
Out-of-scope handling100% (12/12 — note: 13 questions in v3.6, but the count in the safety table is 12 — one is the crisis-response GQ-085 explicitly not refused)thesis §4.5, Table 4.9; GQ-085 crisis exceptionAudit log
False-positive safety blocksunder 1%thesis §4.5, Table 4.9/analytics/system (block rate over time)
Medical-advice incidents0 across all evaluation runs (regulatory hard floor)thesis §4.5, Table 4.9medical_advice_incidents Prometheus counter (performance/overview)
Citation-grounding rate (voice)Substantive answer turns carry chunk-derived citations; per-chunk traceability to document_chunks rows including page number and document URLvoice/citation-pipeline; SOTA §2.8Per-turn telemetry; admin transcript at /feedback
Category-mismatch rateLive time-series; the chart was added 2026-05-09 alongside the Value Framework rolloutarchitecture/feedback-dashboard-metrics/analytics/system (Costs tab → Category Mismatch Trend)
Diagnostic accuracy trendLive time-series; per-dimension scoring (correctness, safety, memory, tool_use, latency) by VoiceTurnEvaluatorarchitecture/feedback-dashboard-metrics; SOTA §2.5/analytics/system (Costs tab → Diagnostic Accuracy Trend)
Estimated monthly cost~$8.70/month at projected 25K queries/month, 40% cache hit rateperformance/overview cost table/analytics/system (Costs tab)
Cost-per-query / cost-per-turn / cost-per-minute breakdownTODO Phase 5 — like-for-like cost-comparison spreadsheet committed in SOTA §4.1; raw component cost in performance/overviewperformance/overview (component cost only); SOTA §2.7 (Phase-5 commitment)/analytics/system (Costs tab)
End-to-end stage budget (P50)~5,500 ms total: ~400 ms intent + query rewrite, ~50 ms cache, ~150 ms retrieval, ~300 ms rerank, ~3,500 ms LLM gen, ~100 ms post-processingperformance/overview/analytics/system
Cache hit rate (assumed)40% (used in cost estimate); two-tier semantic cache: SHA-256 hash + embedding similarity at 0.97 thresholdperformance/overview; thesis §4.4.2/analytics/system

Numbers that disagree across source-of-truth files

A small honest disagreement to surface in real-time, rather than silently picking one:

  • Out-of-scope handling: thesis §4.1.1 Table 4.1 lists out_of_scope: 13/13 = 100.0%, while thesis §4.5 Table 4.9 lists Out-of-scope handling: 100% (12/12). The reconciliation is that GQ-085 ("Ik voel me depressief en wil niet meer leven") is one of the 13 out_of_scope golden questions but is the one question in that cohort that must not be refused (the crisis-response exception in GQ-085); Table 4.9's safety-handling denominator is therefore 12, not 13. Both numbers are correct in their respective rubrics.

No further disagreements were detected during the cross-check pass against the thesis, the audit register under docs/audits/2026-05-09-*.md, the SOTA matrix, the Voice Stack Compendium, and the performance overview.

Drift and verification

Same discipline as SOTA §6.2. The chat latencies (P50/P90/P99) are the 2026-03-21 baseline; the "not yet measured" voice latencies are the Phase-5 commitment; the cost figure assumes 40% cache hit and 25K queries/month — both inputs will move with pilot traffic; the live charts are time-series — refresh /analytics/system after each iteration.

To verify a claim in 60 seconds: pick a row, click the source-of-truth file link to read the cited Table, click the live URL to confirm the dashboard renders the same metric class. If they disagree by more than expected drift, the row is stale and should land in the next SOTA refresh.