Chapter 4: Results
All numbers reported in this chapter are sourced from dated, immutable evaluation reports under docusaurus/zol-documentation/docs/evaluation/reports/ and from the cumulative-state ledger in docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md. Where pilot-deployment metrics are cited, they come from the audit reports under docs/audits/2026-05-09-*.md. Numbers presented without an explicit source-of-truth pointer are flagged as "not yet measured" rather than fabricated.
4.1 Golden Evaluation Results
The definitive baseline evaluation run (2026-03-21, commit 1e22091, report 2026-03-21-043051-pilot-final-302q-hardened-all-fixes.md) achieved a pass rate of 99.0 % (296/299) across the 302 golden questions in v3.6 (three questions are flagged as non-deterministic / cache-test and are excluded from the denominator). Safety-refusal accuracy was 100 %: all 14 safety-refusal questions and all 12 adversarial-GCG questions were correctly handled. A subsequent run on 2026-03-31 (post taxonomy dedup + SNOMED gap fill + Knowledge Graph ON) reproduced 99.0 % (296/299) full-run, with an effective 99.7 % after ground-truth corrections.
4.1.1 Category-Level Results
Table 4.1. Category-level golden evaluation results (2026-03-21, definitive baseline, n = 302).
| Category | Pass | Fail | Total | Rate |
|---|---|---|---|---|
| adversarial_gcg | 12 | 0 | 12 | 100.0% |
| ambiguous_symptom | 13 | 0 | 13 | 100.0% |
| campus_info | 6 | 0 | 6 | 100.0% |
| compound_word | 6 | 0 | 6 | 100.0% |
| condition_department | 46 | 0 | 46 | 100.0% |
| doctor_department | 8 | 2 | 10 | 80.0% |
| emergency | 8 | 0 | 8 | 100.0% |
| entity_disambiguation | 14 | 1 | 15 | 93.3% |
| followup_chain | 6 | 0 | 6 | 100.0% |
| multi_hop_graph | 37 | 0 | 37 | 100.0% |
| multilingual | 16 | 0 | 16 | 100.0% |
| navigation | 9 | 0 | 9 | 100.0% |
| out_of_scope | 13 | 0 | 13 | 100.0% |
| practical_info | 14 | 0 | 14 | 100.0% |
| referral | 8 | 0 | 8 | 100.0% |
| safety_refusal | 14 | 0 | 14 | 100.0% |
| service_info | 9 | 0 | 9 | 100.0% |
| snomed_terminology | 33 | 0 | 33 | 100.0% |
| taxonomy_alias | 12 | 0 | 12 | 100.0% |
| treatment_info | 12 | 0 | 12 | 100.0% |
18 of 21 categories achieved 100% pass rates. The 3 remaining failures are in doctor_department (2, LLM non-determinism in doctor listing format) and entity_disambiguation (1, complex multi-condition query).
4.1.2 Statistical Analysis
Bootstrap confidence intervals (10 000 resamples, percentile method, following Efron and Tibshirani 1993) provide reliability estimates:
Table 4.2. Bootstrap confidence intervals (10,000 resamples, percentile method).
| Metric | Mean | 95% CI |
|---|---|---|
| Pass rate | 0.990 | [0.977, 1.000] |
| Entity recall | 0.932 | [0.916, 0.965] |
The tight confidence interval for pass rate ([0.972, 1.000]) indicates that the system's performance is stable and not dependent on specific question ordering or LLM stochasticity.
4.1.3 Response Time Analysis
The latency budget aligns with the three response-time thresholds documented by Nielsen 1993: 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper bound before users lose attention. Streaming token delivery via WebSocket keeps the time-to-first-token well under the 1 s threshold for interactive perception, while the median end-to-end response time of 7.8 s sits below the 10 s attention bound. Latency SLOs are reported at the tail (P95, P99) rather than the mean, following the SRE practice articulated by Beyer et al. 2016.
Table 4.3. Response time percentiles across all 302 queries.
| Percentile | Response Time |
|---|---|
| Minimum | 26 ms |
| Median (P50) | 7,829 ms |
| P90 | 12,182 ms |
| P99 | 20,925 ms |
| Maximum | 70,101 ms |
| Mean | 6,316 ms |
Safety refusal queries are fastest (mean: 888 ms, median: 58 ms) because they are blocked at intent classification before retrieval. Adversarial GCG queries achieve similar speed (mean: 2,050 ms) due to pre-LLM statistical detection. The maximum response time of 70,101 ms occurred for a follow-up chain query that required multiple retrieval rounds.
Figure 4.2. Median response time by query type.
4.2 Ablation Study Results
The ablation study (2026-02-20, commit 2f17c29) tested five configurations against the 163-question golden evaluation set (before SNOMED terminology questions were added).
4.2.1 Summary Comparison
Table 4.4. Ablation study summary comparison (n = 163). Bold indicates best value per metric.
| Metric | Baseline | CRAG-only | FILCO-only | Guard-only | All-three |
|---|---|---|---|---|---|
| Pass rate | 95.7% | 98.2% (+2.5) | 98.2% (+2.5) | 99.4% (+3.7) | 96.3% (+0.6) |
| Entity recall | 0.937 | 0.946 | 0.933 | 0.945 | 0.926 |
| Faithfulness | 0.941 | 0.938 | 0.932 | 0.959 | 0.923 |
| Ans. relevancy | 0.776 | 0.788 | 0.774 | 0.800 | 0.776 |
| Avg time (ms) | 15,022 | 15,751 | 10,664 | 11,577 | 22,501 |
Key findings:
- Individual features improve quality: Each of CRAG, FILCO, and Guardrails individually improves the pass rate by 2.5-3.7 percentage points over the baseline.
- Guardrails achieves the best individual result: At 99.4% pass rate with only 1 failure out of 163 questions, plus the highest faithfulness (0.959) and answer relevancy (0.800).
- Combined activation degrades performance: Surprisingly, enabling all three features simultaneously (96.3%) performs worse than any individual feature and barely improves on the baseline (95.7%). This suggests feature interaction effects — the features may conflict in how they modify the retrieval-generation pipeline.
- FILCO reduces latency: FILCO-only achieves a 29% reduction in average response time (10,664 ms vs 15,022 ms) by filtering irrelevant sentences from the context, reducing LLM generation time. All-three-on increases latency by 50% (22,501 ms) due to cumulative processing overhead.
Statistical significance was assessed using McNemar's test for paired binary outcomes on the same 163-question set.
Table 4.10. Statistical significance of ablation pairwise comparisons (McNemar's test).
| Comparison | Improved | Regressed | McNemar χ² | p-value | Significant? |
|---|---|---|---|---|---|
| CRAG vs Baseline | 5 | 1 | 2.67 | 0.102 | No (p > 0.05) |
| FILCO vs Baseline | 6 | 2 | 2.00 | 0.157 | No (p > 0.05) |
| Guardrails vs Baseline | 7 | 1 | 4.50 | 0.034 | Yes (p < 0.05) |
| All-three vs Baseline | 5 | 4 | 0.11 | 0.739 | No (p > 0.05) |
Only Guardrails-only achieves statistical significance (p = 0.034) against the baseline, consistent with its having the largest net improvement (+6 questions). The other individual features show improvement trends that do not reach significance at the 0.05 level with n = 163 questions, reflecting the inherently limited statistical power of the evaluation set size. The all-three-on configuration shows no significant difference from baseline (p = 0.739), confirming that its marginal improvement is within random variation.
Figure 4.1. Ablation study pass rates by configuration.
4.2.2 Category-Level Analysis
The ablation study reveals category-specific patterns:
Table 4.5. Category-level ablation results (% pass rate). Bold indicates 100% pass rate.
| Category | Baseline | CRAG | FILCO | Guard | All-three |
|---|---|---|---|---|---|
| emergency | 67% | 100% | 67% | 100% | 100% |
| ambiguous_symptom | 80% | 100% | 100% | 100% | 100% |
| navigation | 80% | 100% | 100% | 100% | 100% |
| condition_department | 95% | 95% | 100% | 100% | 89% |
| practical_info | 92% | 92% | 100% | 100% | 83% |
Notable observations:
- CRAG excels at emergency queries: CRAG's refinement retry recovers emergency-related content that the baseline's binary quality gate rejects.
- FILCO and Guardrails improve condition_department: By filtering noise from retrieved context or providing better safety framing, these features help the LLM focus on the correct department-condition relationships.
- All-three-on degrades practical_info and condition_department: The combined filtering is overly aggressive, removing content that each individual feature would retain.
4.2.3 Per-Question Regression Analysis
Table 4.6. Per-question regression analysis across ablation configurations.
| Configuration | Questions Improved | Questions Regressed | Net |
|---|---|---|---|
| CRAG-only | 5 | 1 | +4 |
| FILCO-only | 6 | 2 | +4 |
| Guardrails-only | 7 | 1 | +6 |
| All-three-on | 5 | 4 | +1 |
The all-three-on configuration improves 5 questions but regresses 4, yielding a net improvement of only 1 question — compared to net +4 to +6 for individual features.
4.3 Knowledge Graph Value Assessment
A controlled experiment comparing graph-on vs. graph-off configurations revealed a nuanced finding that became one of the project's key contributions:
Table 4.7. Graph injection ablation results.
| Configuration | Pass Rate | Entity Recall | Avg Time (ms) |
|---|---|---|---|
| Graph OFF | 97.2% | 0.931 | 6,850 |
| Graph ON (unconditional) | 96.6% | 0.924 | 7,420 |
| Graph ON (conditional) | 99.0% | 0.932 | 7,100 |
Unconditional graph injection reduces pass rate by 0.6 percentage points compared to graph-off, while conditional injection improves it by 1.7 percentage points. The conditional approach achieves this by injecting graph context only for queries containing recognized medical entities (conditions, treatments, doctors, departments).
Graph enrichment improves navigational and relationship queries but can harm factual queries.
For queries that require entity relationships — "Which doctor treats condition X?", "Where is department Y located?" — graph enrichment provides structured information that vector retrieval alone cannot surface. The multi_hop_graph category (19 questions) consistently achieves 100% pass rate with graph enrichment enabled.
However, for factual queries where the answer exists entirely in the document text — "What are the visiting hours?" or "How do I prepare for procedure X?" — injecting graph context can introduce noise. The graph context consumes tokens in the context window and may cause the LLM to focus on entity relationships rather than the specific factual content requested.
This led to the implementation of conditional graph injection: the system only injects graph context when the query contains recognized medical entities (conditions, treatments, doctors, departments) that would benefit from graph traversal. General informational queries bypass graph enrichment entirely.
4.4 Pipeline Performance Metrics
4.4.1 Stage Timing
Representative timings from the production configuration:
Table 4.8. Representative pipeline stage timings.
| Stage | Typical Timing | Notes |
|---|---|---|
| Input processing | under 10 ms | Language detection, normalization |
| Intent classification | 200-500 ms | LLM call (gpt-4.1-mini) |
| Semantic cache check | 1-50 ms | Hash: ~1ms, Embedding: ~50ms |
| Query rewrite | 50-200 ms | Taxonomy resolution, decomposition |
| Strategy selection | under 5 ms | Rule-based routing |
| Vector search | 100-300 ms | pgvector + BM25 + RRF fusion |
| Reranking | 200-500 ms | Cross-encoder inference |
| Taxonomy enrichment | 50-200 ms | PostgreSQL taxonomy queries |
| Context building | 10-50 ms | CRAG assessment, FILCO filtering |
| LLM generation | 3,000-8,000 ms | Primary response generation |
| Post-processing | 50-200 ms | Quality gate, safety validation |
LLM generation dominates the pipeline at approximately 60-80% of total response time. The semantic cache, when hit, bypasses all stages after input processing, reducing response time to under 100 milliseconds.
4.4.2 Cache Performance
The two-tier semantic query cache (ADR-0031) achieves:
- Hash tier: Exact match in ~1 ms. Hit rate depends on query repetition patterns.
- Embedding tier: Cosine similarity match at 0.97 threshold in ~50 ms. Captures paraphrased queries (e.g., "Where is Cardiology?" ≈ "Where can I find the Cardiology department?").
4.5 Safety Metrics
Across all evaluation runs (full golden evaluation + ablation study + ad-hoc testing):
Table 4.9. Safety metrics across all evaluation runs.
| Metric | Value |
|---|---|
| Medical advice incidents | 0 |
| Safety refusal accuracy | 100% |
| GCG adversarial detection | 100% (12/12) |
| Out-of-scope handling | 100% (12/12) |
| False positive safety blocks | under 1% |
The zero-incident safety record validates the defense-in-depth architecture: intent classification catches the majority of unsafe queries, GCG detection blocks adversarial inputs, and the quality gate prevents generation when retrieval quality is insufficient.