Chapter 4: Results

Empirical truth-source

All numbers reported in this chapter are sourced from dated, immutable evaluation reports under docusaurus/zol-documentation/docs/evaluation/reports/ and from the cumulative-state ledger in docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md. Where pilot-deployment metrics are cited, they come from the audit reports under docs/audits/2026-05-09-*.md. Numbers presented without an explicit source-of-truth pointer are flagged as "not yet measured" rather than fabricated.

4.1 Golden Evaluation Results

The definitive baseline evaluation run (2026-03-21, commit 1e22091, report 2026-03-21-043051-pilot-final-302q-hardened-all-fixes.md) achieved a pass rate of 99.0 % (296/299) across the 302 golden questions in v3.6 (three questions are flagged as non-deterministic / cache-test and are excluded from the denominator). Safety-refusal accuracy was 100 %: all 14 safety-refusal questions and all 12 adversarial-GCG questions were correctly handled. A subsequent run on 2026-03-31 (post taxonomy dedup + SNOMED gap fill + Knowledge Graph ON) reproduced 99.0 % (296/299) full-run, with an effective 99.7 % after ground-truth corrections.

4.1.1 Category-Level Results

Table 4.1. Category-level golden evaluation results (2026-03-21, definitive baseline, n = 302).

Category	Pass	Fail	Total	Rate
adversarial_gcg	12	0	12	100.0%
ambiguous_symptom	13	0	13	100.0%
campus_info	6	0	6	100.0%
compound_word	6	0	6	100.0%
condition_department	46	0	46	100.0%
doctor_department	8	2	10	80.0%
emergency	8	0	8	100.0%
entity_disambiguation	14	1	15	93.3%
followup_chain	6	0	6	100.0%
multi_hop_graph	37	0	37	100.0%
multilingual	16	0	16	100.0%
navigation	9	0	9	100.0%
out_of_scope	13	0	13	100.0%
practical_info	14	0	14	100.0%
referral	8	0	8	100.0%
safety_refusal	14	0	14	100.0%
service_info	9	0	9	100.0%
snomed_terminology	33	0	33	100.0%
taxonomy_alias	12	0	12	100.0%
treatment_info	12	0	12	100.0%

18 of 21 categories achieved 100% pass rates. The 3 remaining failures are in doctor_department (2, LLM non-determinism in doctor listing format) and entity_disambiguation (1, complex multi-condition query).

4.1.2 Statistical Analysis

Bootstrap confidence intervals (10 000 resamples, percentile method, following Efron and Tibshirani 1993) provide reliability estimates:

Table 4.2. Bootstrap confidence intervals (10,000 resamples, percentile method).

Metric	Mean	95% CI
Pass rate	0.990	[0.977, 1.000]
Entity recall	0.932	[0.916, 0.965]

The tight confidence interval for pass rate ([0.972, 1.000]) indicates that the system's performance is stable and not dependent on specific question ordering or LLM stochasticity.

4.1.3 Response Time Analysis

The latency budget aligns with the three response-time thresholds documented by Nielsen 1993: 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper bound before users lose attention. Streaming token delivery via WebSocket keeps the time-to-first-token well under the 1 s threshold for interactive perception, while the median end-to-end response time of 7.8 s sits below the 10 s attention bound. Latency SLOs are reported at the tail (P95, P99) rather than the mean, following the SRE practice articulated by Beyer et al. 2016.

Table 4.3. Response time percentiles across all 302 queries.

Percentile	Response Time
Minimum	26 ms
Median (P50)	7,829 ms
P90	12,182 ms
P99	20,925 ms
Maximum	70,101 ms
Mean	6,316 ms

Safety refusal queries are fastest (mean: 888 ms, median: 58 ms) because they are blocked at intent classification before retrieval. Adversarial GCG queries achieve similar speed (mean: 2,050 ms) due to pre-LLM statistical detection. The maximum response time of 70,101 ms occurred for a follow-up chain query that required multiple retrieval rounds.

Figure 4.2. Median response time by query type.

4.2 Ablation Study Results

The ablation study (2026-02-20, commit 2f17c29) tested five configurations against the 163-question golden evaluation set (before SNOMED terminology questions were added).

4.2.1 Summary Comparison

Table 4.4. Ablation study summary comparison (n = 163). Bold indicates best value per metric.

Metric	Baseline	CRAG-only	FILCO-only	Guard-only	All-three
Pass rate	95.7%	98.2% (+2.5)	98.2% (+2.5)	99.4% (+3.7)	96.3% (+0.6)
Entity recall	0.937	0.946	0.933	0.945	0.926
Faithfulness	0.941	0.938	0.932	0.959	0.923
Ans. relevancy	0.776	0.788	0.774	0.800	0.776
Avg time (ms)	15,022	15,751	10,664	11,577	22,501

Key findings:

Individual features improve quality: Each of CRAG, FILCO, and Guardrails individually improves the pass rate by 2.5-3.7 percentage points over the baseline.
Guardrails achieves the best individual result: At 99.4% pass rate with only 1 failure out of 163 questions, plus the highest faithfulness (0.959) and answer relevancy (0.800).
Combined activation degrades performance: Surprisingly, enabling all three features simultaneously (96.3%) performs worse than any individual feature and barely improves on the baseline (95.7%). This suggests feature interaction effects — the features may conflict in how they modify the retrieval-generation pipeline.
FILCO reduces latency: FILCO-only achieves a 29% reduction in average response time (10,664 ms vs 15,022 ms) by filtering irrelevant sentences from the context, reducing LLM generation time. All-three-on increases latency by 50% (22,501 ms) due to cumulative processing overhead.

Statistical significance was assessed using McNemar's test for paired binary outcomes on the same 163-question set.

Table 4.10. Statistical significance of ablation pairwise comparisons (McNemar's test).

Comparison	Improved	Regressed	McNemar χ²	p-value	Significant?
CRAG vs Baseline	5	1	2.67	0.102	No (p > 0.05)
FILCO vs Baseline	6	2	2.00	0.157	No (p > 0.05)
Guardrails vs Baseline	7	1	4.50	0.034	Yes (p < 0.05)
All-three vs Baseline	5	4	0.11	0.739	No (p > 0.05)

Only Guardrails-only achieves statistical significance (p = 0.034) against the baseline, consistent with its having the largest net improvement (+6 questions). The other individual features show improvement trends that do not reach significance at the 0.05 level with n = 163 questions, reflecting the inherently limited statistical power of the evaluation set size. The all-three-on configuration shows no significant difference from baseline (p = 0.739), confirming that its marginal improvement is within random variation.

Figure 4.1. Ablation study pass rates by configuration.

4.2.2 Category-Level Analysis

The ablation study reveals category-specific patterns:

Table 4.5. Category-level ablation results (% pass rate). Bold indicates 100% pass rate.

Category	Baseline	CRAG	FILCO	Guard	All-three
emergency	67%	100%	67%	100%	100%
ambiguous_symptom	80%	100%	100%	100%	100%
navigation	80%	100%	100%	100%	100%
condition_department	95%	95%	100%	100%	89%
practical_info	92%	92%	100%	100%	83%

Notable observations:

CRAG excels at emergency queries: CRAG's refinement retry recovers emergency-related content that the baseline's binary quality gate rejects.
FILCO and Guardrails improve condition_department: By filtering noise from retrieved context or providing better safety framing, these features help the LLM focus on the correct department-condition relationships.
All-three-on degrades practical_info and condition_department: The combined filtering is overly aggressive, removing content that each individual feature would retain.

4.2.3 Per-Question Regression Analysis

Table 4.6. Per-question regression analysis across ablation configurations.

Configuration	Questions Improved	Questions Regressed	Net
CRAG-only	5	1	+4
FILCO-only	6	2	+4
Guardrails-only	7	1	+6
All-three-on	5	4	+1

The all-three-on configuration improves 5 questions but regresses 4, yielding a net improvement of only 1 question — compared to net +4 to +6 for individual features.

4.3 Knowledge Graph Value Assessment

A controlled experiment comparing graph-on vs. graph-off configurations revealed a nuanced finding that became one of the project's key contributions:

Table 4.7. Graph injection ablation results.

Configuration	Pass Rate	Entity Recall	Avg Time (ms)
Graph OFF	97.2%	0.931	6,850
Graph ON (unconditional)	96.6%	0.924	7,420
Graph ON (conditional)	99.0%	0.932	7,100

Unconditional graph injection reduces pass rate by 0.6 percentage points compared to graph-off, while conditional injection improves it by 1.7 percentage points. The conditional approach achieves this by injecting graph context only for queries containing recognized medical entities (conditions, treatments, doctors, departments).

Graph enrichment improves navigational and relationship queries but can harm factual queries.

For queries that require entity relationships — "Which doctor treats condition X?", "Where is department Y located?" — graph enrichment provides structured information that vector retrieval alone cannot surface. The multi_hop_graph category (19 questions) consistently achieves 100% pass rate with graph enrichment enabled.

However, for factual queries where the answer exists entirely in the document text — "What are the visiting hours?" or "How do I prepare for procedure X?" — injecting graph context can introduce noise. The graph context consumes tokens in the context window and may cause the LLM to focus on entity relationships rather than the specific factual content requested.

This led to the implementation of conditional graph injection: the system only injects graph context when the query contains recognized medical entities (conditions, treatments, doctors, departments) that would benefit from graph traversal. General informational queries bypass graph enrichment entirely.

4.4 Pipeline Performance Metrics

4.4.1 Stage Timing

Representative timings from the production configuration:

Table 4.8. Representative pipeline stage timings.

Stage	Typical Timing	Notes
Input processing	under 10 ms	Language detection, normalization
Intent classification	200-500 ms	LLM call (gpt-4.1-mini)
Semantic cache check	1-50 ms	Hash: ~1ms, Embedding: ~50ms
Query rewrite	50-200 ms	Taxonomy resolution, decomposition
Strategy selection	under 5 ms	Rule-based routing
Vector search	100-300 ms	pgvector + BM25 + RRF fusion
Reranking	200-500 ms	Cross-encoder inference
Taxonomy enrichment	50-200 ms	PostgreSQL taxonomy queries
Context building	10-50 ms	CRAG assessment, FILCO filtering
LLM generation	3,000-8,000 ms	Primary response generation
Post-processing	50-200 ms	Quality gate, safety validation

LLM generation dominates the pipeline at approximately 60-80% of total response time. The semantic cache, when hit, bypasses all stages after input processing, reducing response time to under 100 milliseconds.

4.4.2 Cache Performance

The two-tier semantic query cache (ADR-0031) achieves:

Hash tier: Exact match in ~1 ms. Hit rate depends on query repetition patterns.
Embedding tier: Cosine similarity match at 0.97 threshold in ~50 ms. Captures paraphrased queries (e.g., "Where is Cardiology?" ≈ "Where can I find the Cardiology department?").

4.5 Safety Metrics

Across all evaluation runs (full golden evaluation + ablation study + ad-hoc testing):

Table 4.9. Safety metrics across all evaluation runs.

Metric	Value
Medical advice incidents	0
Safety refusal accuracy	100%
GCG adversarial detection	100% (12/12)
Out-of-scope handling	100% (12/12)
False positive safety blocks	under 1%

The zero-incident safety record validates the defense-in-depth architecture: intent classification catches the majority of unsafe queries, GCG detection blocks adversarial inputs, and the quality gate prevents generation when retrieval quality is insufficient.

4.1 Golden Evaluation Results​

4.1.1 Category-Level Results​

4.1.2 Statistical Analysis​

4.1.3 Response Time Analysis​

4.2 Ablation Study Results​

4.2.1 Summary Comparison​

4.2.2 Category-Level Analysis​

4.2.3 Per-Question Regression Analysis​

4.3 Knowledge Graph Value Assessment​

4.4 Pipeline Performance Metrics​

4.4.1 Stage Timing​

4.4.2 Cache Performance​

4.5 Safety Metrics​