Skip to main content

Chapter 4: Results

Empirical truth-source

All numbers reported in this chapter are sourced from dated, immutable evaluation reports under docusaurus/zol-documentation/docs/evaluation/reports/ and from the cumulative-state ledger in docs/superpowers/plans/2026-05-09-pilot-review-readiness-plan.md. Where pilot-deployment metrics are cited, they come from the audit reports under docs/audits/2026-05-09-*.md. Numbers presented without an explicit source-of-truth pointer are flagged as "not yet measured" rather than fabricated.

4.1 Golden Evaluation Results

The definitive baseline evaluation run (2026-03-21, commit 1e22091, report 2026-03-21-043051-pilot-final-302q-hardened-all-fixes.md) achieved a pass rate of 99.0 % (296/299) across the 302 golden questions in v3.6 (three questions are flagged as non-deterministic / cache-test and are excluded from the denominator). Safety-refusal accuracy was 100 %: all 14 safety-refusal questions and all 12 adversarial-GCG questions were correctly handled. A subsequent run on 2026-03-31 (post taxonomy dedup + SNOMED gap fill + Knowledge Graph ON) reproduced 99.0 % (296/299) full-run, with an effective 99.7 % after ground-truth corrections.

4.1.1 Category-Level Results

Table 4.1. Category-level golden evaluation results (2026-03-21, definitive baseline, n = 302).

CategoryPassFailTotalRate
adversarial_gcg12012100.0%
ambiguous_symptom13013100.0%
campus_info606100.0%
compound_word606100.0%
condition_department46046100.0%
doctor_department821080.0%
emergency808100.0%
entity_disambiguation1411593.3%
followup_chain606100.0%
multi_hop_graph37037100.0%
multilingual16016100.0%
navigation909100.0%
out_of_scope13013100.0%
practical_info14014100.0%
referral808100.0%
safety_refusal14014100.0%
service_info909100.0%
snomed_terminology33033100.0%
taxonomy_alias12012100.0%
treatment_info12012100.0%

18 of 21 categories achieved 100% pass rates. The 3 remaining failures are in doctor_department (2, LLM non-determinism in doctor listing format) and entity_disambiguation (1, complex multi-condition query).

4.1.2 Statistical Analysis

Bootstrap confidence intervals (10 000 resamples, percentile method, following Efron and Tibshirani 1993) provide reliability estimates:

Table 4.2. Bootstrap confidence intervals (10,000 resamples, percentile method).

MetricMean95% CI
Pass rate0.990[0.977, 1.000]
Entity recall0.932[0.916, 0.965]

The tight confidence interval for pass rate ([0.972, 1.000]) indicates that the system's performance is stable and not dependent on specific question ordering or LLM stochasticity.

4.1.3 Response Time Analysis

The latency budget aligns with the three response-time thresholds documented by Nielsen 1993: 0.1 s for instantaneous feedback, 1 s for seamless flow, 10 s as the upper bound before users lose attention. Streaming token delivery via WebSocket keeps the time-to-first-token well under the 1 s threshold for interactive perception, while the median end-to-end response time of 7.8 s sits below the 10 s attention bound. Latency SLOs are reported at the tail (P95, P99) rather than the mean, following the SRE practice articulated by Beyer et al. 2016.

Table 4.3. Response time percentiles across all 302 queries.

PercentileResponse Time
Minimum26 ms
Median (P50)7,829 ms
P9012,182 ms
P9920,925 ms
Maximum70,101 ms
Mean6,316 ms

Safety refusal queries are fastest (mean: 888 ms, median: 58 ms) because they are blocked at intent classification before retrieval. Adversarial GCG queries achieve similar speed (mean: 2,050 ms) due to pre-LLM statistical detection. The maximum response time of 70,101 ms occurred for a follow-up chain query that required multiple retrieval rounds.

Figure 4.2. Median response time by query type.

4.2 Ablation Study Results

The ablation study (2026-02-20, commit 2f17c29) tested five configurations against the 163-question golden evaluation set (before SNOMED terminology questions were added).

4.2.1 Summary Comparison

Table 4.4. Ablation study summary comparison (n = 163). Bold indicates best value per metric.

MetricBaselineCRAG-onlyFILCO-onlyGuard-onlyAll-three
Pass rate95.7%98.2% (+2.5)98.2% (+2.5)99.4% (+3.7)96.3% (+0.6)
Entity recall0.9370.9460.9330.9450.926
Faithfulness0.9410.9380.9320.9590.923
Ans. relevancy0.7760.7880.7740.8000.776
Avg time (ms)15,02215,75110,66411,57722,501

Key findings:

  1. Individual features improve quality: Each of CRAG, FILCO, and Guardrails individually improves the pass rate by 2.5-3.7 percentage points over the baseline.
  2. Guardrails achieves the best individual result: At 99.4% pass rate with only 1 failure out of 163 questions, plus the highest faithfulness (0.959) and answer relevancy (0.800).
  3. Combined activation degrades performance: Surprisingly, enabling all three features simultaneously (96.3%) performs worse than any individual feature and barely improves on the baseline (95.7%). This suggests feature interaction effects — the features may conflict in how they modify the retrieval-generation pipeline.
  4. FILCO reduces latency: FILCO-only achieves a 29% reduction in average response time (10,664 ms vs 15,022 ms) by filtering irrelevant sentences from the context, reducing LLM generation time. All-three-on increases latency by 50% (22,501 ms) due to cumulative processing overhead.

Statistical significance was assessed using McNemar's test for paired binary outcomes on the same 163-question set.

Table 4.10. Statistical significance of ablation pairwise comparisons (McNemar's test).

ComparisonImprovedRegressedMcNemar χ²p-valueSignificant?
CRAG vs Baseline512.670.102No (p > 0.05)
FILCO vs Baseline622.000.157No (p > 0.05)
Guardrails vs Baseline714.500.034Yes (p < 0.05)
All-three vs Baseline540.110.739No (p > 0.05)

Only Guardrails-only achieves statistical significance (p = 0.034) against the baseline, consistent with its having the largest net improvement (+6 questions). The other individual features show improvement trends that do not reach significance at the 0.05 level with n = 163 questions, reflecting the inherently limited statistical power of the evaluation set size. The all-three-on configuration shows no significant difference from baseline (p = 0.739), confirming that its marginal improvement is within random variation.

Figure 4.1. Ablation study pass rates by configuration.

4.2.2 Category-Level Analysis

The ablation study reveals category-specific patterns:

Table 4.5. Category-level ablation results (% pass rate). Bold indicates 100% pass rate.

CategoryBaselineCRAGFILCOGuardAll-three
emergency67%100%67%100%100%
ambiguous_symptom80%100%100%100%100%
navigation80%100%100%100%100%
condition_department95%95%100%100%89%
practical_info92%92%100%100%83%

Notable observations:

  • CRAG excels at emergency queries: CRAG's refinement retry recovers emergency-related content that the baseline's binary quality gate rejects.
  • FILCO and Guardrails improve condition_department: By filtering noise from retrieved context or providing better safety framing, these features help the LLM focus on the correct department-condition relationships.
  • All-three-on degrades practical_info and condition_department: The combined filtering is overly aggressive, removing content that each individual feature would retain.

4.2.3 Per-Question Regression Analysis

Table 4.6. Per-question regression analysis across ablation configurations.

ConfigurationQuestions ImprovedQuestions RegressedNet
CRAG-only51+4
FILCO-only62+4
Guardrails-only71+6
All-three-on54+1

The all-three-on configuration improves 5 questions but regresses 4, yielding a net improvement of only 1 question — compared to net +4 to +6 for individual features.

4.3 Knowledge Graph Value Assessment

A controlled experiment comparing graph-on vs. graph-off configurations revealed a nuanced finding that became one of the project's key contributions:

Table 4.7. Graph injection ablation results.

ConfigurationPass RateEntity RecallAvg Time (ms)
Graph OFF97.2%0.9316,850
Graph ON (unconditional)96.6%0.9247,420
Graph ON (conditional)99.0%0.9327,100

Unconditional graph injection reduces pass rate by 0.6 percentage points compared to graph-off, while conditional injection improves it by 1.7 percentage points. The conditional approach achieves this by injecting graph context only for queries containing recognized medical entities (conditions, treatments, doctors, departments).

Graph enrichment improves navigational and relationship queries but can harm factual queries.

For queries that require entity relationships — "Which doctor treats condition X?", "Where is department Y located?" — graph enrichment provides structured information that vector retrieval alone cannot surface. The multi_hop_graph category (19 questions) consistently achieves 100% pass rate with graph enrichment enabled.

However, for factual queries where the answer exists entirely in the document text — "What are the visiting hours?" or "How do I prepare for procedure X?" — injecting graph context can introduce noise. The graph context consumes tokens in the context window and may cause the LLM to focus on entity relationships rather than the specific factual content requested.

This led to the implementation of conditional graph injection: the system only injects graph context when the query contains recognized medical entities (conditions, treatments, doctors, departments) that would benefit from graph traversal. General informational queries bypass graph enrichment entirely.

4.4 Pipeline Performance Metrics

4.4.1 Stage Timing

Representative timings from the production configuration:

Table 4.8. Representative pipeline stage timings.

StageTypical TimingNotes
Input processingunder 10 msLanguage detection, normalization
Intent classification200-500 msLLM call (gpt-4.1-mini)
Semantic cache check1-50 msHash: ~1ms, Embedding: ~50ms
Query rewrite50-200 msTaxonomy resolution, decomposition
Strategy selectionunder 5 msRule-based routing
Vector search100-300 mspgvector + BM25 + RRF fusion
Reranking200-500 msCross-encoder inference
Taxonomy enrichment50-200 msPostgreSQL taxonomy queries
Context building10-50 msCRAG assessment, FILCO filtering
LLM generation3,000-8,000 msPrimary response generation
Post-processing50-200 msQuality gate, safety validation

LLM generation dominates the pipeline at approximately 60-80% of total response time. The semantic cache, when hit, bypasses all stages after input processing, reducing response time to under 100 milliseconds.

4.4.2 Cache Performance

The two-tier semantic query cache (ADR-0031) achieves:

  • Hash tier: Exact match in ~1 ms. Hit rate depends on query repetition patterns.
  • Embedding tier: Cosine similarity match at 0.97 threshold in ~50 ms. Captures paraphrased queries (e.g., "Where is Cardiology?" ≈ "Where can I find the Cardiology department?").

4.5 Safety Metrics

Across all evaluation runs (full golden evaluation + ablation study + ad-hoc testing):

Table 4.9. Safety metrics across all evaluation runs.

MetricValue
Medical advice incidents0
Safety refusal accuracy100%
GCG adversarial detection100% (12/12)
Out-of-scope handling100% (12/12)
False positive safety blocksunder 1%

The zero-incident safety record validates the defense-in-depth architecture: intent classification catches the majority of unsafe queries, GCG detection blocks adversarial inputs, and the quality gate prevents generation when retrieval quality is insufficient.