Evaluation Overview
This section documents the evaluation methodology, results, and analysis of the ZOL Intelligent Search system. It contains 302 golden questions across 21 categories (v3.6), 45+ automated evaluation reports, ablation studies, and safety testing results. This page provides a guided reading path through the material.
Reading Guide
The evaluation documentation is organized in three layers: methodology (how we evaluate), results (what we measured), and analysis (what it means). The table below suggests a reading order depending on your interest.
Quick Navigation
| Document | What It Covers | Read If You... |
|---|---|---|
| Latest Eval (2026-03-31) | 99.0% full run (99.7% effective with ground truth fixes) — taxonomy dedup, SNOMED gap fill, Knowledge Graph ON | Want the latest production eval results |
| Post-Gap-Fill (2026-03-30) | 97.7% — after LLM batch classification of 1,674 orphaned entities | Want the gap-fill impact data |
| Pre-Dedup Baseline (2026-03-29) | 98.7% — last eval before taxonomy dedup (12,997 bloated entities) | Want the pre-dedup baseline |
| RAG Mixin Split (2026-03-27) | 99.7% — RAG mixin refactoring, dedup, all fixes | Want the previous best result |
| Hardened Baseline (2026-03-21) | 99.0% — hardened pilot, 302 questions, GPT-4.1 evaluator | Want the previous hardened baseline |
| 2026-03-19 | Pilot v2 — full taxonomy extraction (2,213 entities) | 95.9% (257/268) |
| 2026-03-20 | SNOMED expansion + taxonomy TREATS + prompt fixes | 98.7% (295/299) |
| 2026-03-21 | Definitive baseline — hardened pilot, 302q v3.6 | 99.0% (296/299) |
| 2026-03-27 | RAG mixin split, dedup, all fixes | 99.7% (298/299) |
| 2026-03-29 | PDF corpus scaling incident | 98.7% (295/299) |
| 2026-03-31 | Taxonomy dedup + gap fill + intent fix + Graph ON | 99.0% (296/299), effective 99.7% |
| Research Conclusion | Final metrics, research journey, key findings | Want the headline results and executive summary |
| Golden Questions | 302 test questions (v3.6), scoring criteria, design methodology | Want to understand the evaluation benchmark |
| Composite Quality Gate | Multi-metric pass/fail criteria for LLM-as-judge evaluation | Want to understand why single-metric gating fails |
| Ablation Study | Feature impact analysis (CRAG, FILCO, Guardrails) | Want to see which pipeline features matter |
| A/B Experiment: Vector vs Hybrid | Controlled comparison of vector-only vs graph-augmented retrieval | Want evidence for/against the knowledge graph |
| Conditional Graph Injection | Final architecture: selective graph context inclusion | Want to understand the graph injection decision |
| Academic Critical Assessment | Honest gap analysis against state-of-the-art RAG systems | Want a critical perspective on limitations |
| Research Bibliography | 60+ academic references in APA 7th edition | Need to verify citations or find source papers |
Suggested Reading Paths
For jury members / assessors: Research Conclusion, then Golden Questions (sections 1-3), then Academic Critical Assessment.
For technical reviewers: Golden Questions (full), then Ablation Study, then A/B Experiment, then Conditional Graph Injection.
For safety auditors: Golden Questions (section on safety categories), then Production Configuration Validation, then Research Conclusion (safety section).
Evaluation Architecture
The evaluation framework follows an automated pipeline:
- Golden questions (302+ questions, 21 categories (v3.6)) define the benchmark with expected entities and pass criteria
- Feedback-derived questions — real user questions from negative feedback can be added to the golden set via the "Add to golden questions" button in the AI Investigation panel
- Automated runner (
run_evaluation.py) sends each question to the live RAG system and captures the response - Entity recall scoring checks whether expected entities (doctors, departments, conditions, phone numbers) appear in the response using case-insensitive substring matching
- LLM-as-judge metrics (optional, via DeepEval) assess faithfulness, answer relevancy, context precision, and context recall using a GPT model as evaluator, following the RAGAS framework (Es et al. 2023)
- Report generation produces a structured Markdown report with per-question results, statistical analysis, and category breakdowns
Each evaluation run produces a timestamped report in the Experiment Reports section below. Labels identify the configuration or feature being tested (e.g., phase-c-snomed-alias-elimination, c901-refactoring-verification).
Key Metrics
| Metric | Definition | Final Value |
|---|---|---|
| Pass rate | % of golden questions correctly answered (composite gate: faithfulness + entity recall + relevancy) | 99.0% (296/299)^1^ |
| Entity recall | Fraction of expected entities found in the response (case-insensitive substring) | 0.932 mean |
| Faithfulness | LLM-judge score: is the response grounded in retrieved context? (0-1) | 0.920 |
| Answer relevancy | LLM-judge score: does the response address the question? (0-1) | 0.944 |
| Safety refusal accuracy | % of unsafe queries (medical advice, medication dosage) correctly refused | 100.0% |
| Adversarial detection | % of GCG-style prompt injection attacks detected | 100.0% |
| Medical advice incidents | Number of responses providing medical advice | 0 |
| Avg response time | Mean end-to-end response latency | 6.3s |
^1^ Pilot evaluation (2026-03-13) used v3.3 (271 golden questions): 268 content + 3 cache-test. Latest evaluations (2026-03-20 onwards) use v3.6 (302 questions) with expanded coverage for SNOMED terminology and taxonomy routing. The definitive baseline (2026-03-21) achieves 99.0% pass rate (296/299) on the hardened pilot.
Model Version History
| Period | RAG Model | Eval Model | Notes |
|---|---|---|---|
| Feb 2026 | GPT-4.1 via OpenRouter | GPT-4.1 mini via OpenRouter | Initial development |
| Mar 2026 (early) | GPT-4.1 via OpenRouter | GPT-5.2 via OpenRouter | Evaluated GPT-5.2 as eval judge |
| Mar 2026 (mid) | o4-mini via OpenAI direct | GPT-4.1 via OpenAI direct | OpenRouter removed for reliability |
| Mar 2026 (definitive) | o4-mini via OpenAI direct | GPT-4.1 via OpenAI direct | Definitive baseline (99.0%) |
Methodology Documents
Golden Questions Evaluation Set
The Golden Questions document defines the 302-question benchmark (v3.6). It covers:
- Design methodology: question selection criteria, category taxonomy, difficulty stratification
- Scoring protocol: entity recall calculation, pass/fail thresholds, multi-entity weighted scoring
- Category coverage: 21 intent categories (doctor_lookup, condition_department, practical_info, safety_refusal, multilingual, cache_test, etc.) each exercised by 3+ questions
- Academic grounding: references to evaluation methodology literature (Voorhees 2002; RAGAS framework; DeepEval)
Academic Critical Assessment
The Academic Critical Assessment provides an honest evaluation against state-of-the-art RAG systems, identifying both strengths and gaps. It classifies the ZOL system as a "mature Advanced RAG with significant Modular RAG characteristics" and rates each architectural dimension on a 5-star scale.
Research Bibliography
The canonical, machine-checkable bibliography for the project is at /docs/references (rendered from docs/references.bib). The legacy Research Bibliography page predates the canonical source and is retained for thesis cross-references and discursive context only. New citations across the documentation should land in references.bib and deep-link via /docs/references#bibkey.
The Research Bibliography contains 60+ academic references organised thematically (RAG foundations, retrieval techniques, knowledge graphs, safety, evaluation methodology). All citations follow APA 7th edition format. Where an entry has a corresponding bibkey on the canonical page, the cross-reference is annotated.
Evaluation Reports
Analysis Reports
These are hand-written reports that analyze specific experiments, root causes, or architectural decisions.
| Report | Date | Key Finding |
|---|---|---|
| Pilot Golden Eval — Keycloak Migration | 2026-03-13 | 268/268 (100%) on pilot after Neo4j removal + Keycloak migration |
| Research Conclusion | 2026-02-23 | 100% pass rate, 0.989 faithfulness -- research question answered |
| Conditional Graph Injection | 2026-02-23 | Selective graph context achieves best-of-both-worlds |
| Phase C Analysis | 2026-02-23 | SNOMED synonyms: 17% latency reduction, no regression |
| Graph Value Assessment | 2026-02-23 | Graph ON hurts overall quality but provides critical rescue on entity lookups |
| Production Config Validation | 2026-02-22 | Minimum safety features required for production deployment |
| Ablation Study | 2026-02-20 | CRAG +0.6%, FILCO +1.1%, Guardrails neutral -- all enabled |
| Ablation Root Cause Analysis | 2026-02-21 | Per-question failure investigation across ablation configurations |
| A/B: Vector vs Hybrid | 2026-02-17 | Hybrid retrieval provides statistically significant improvement |
| Progress: v2.3 to v2.5 | 2026-02-17 | Pass rate 92.6% to 97.9% through root-cause fixes |
Automated Evaluation Reports
These are auto-generated reports from the evaluation runner. Each captures a snapshot of system performance at a specific point in time, tagged with a descriptive label. The reports follow a standard format: summary metrics, statistical analysis, per-category breakdown, and per-question results.
Some automated reports show 0.000 for faithfulness, answer relevancy, context precision, and context recall. This indicates that the LLM-as-judge evaluation modules were disabled for that run (using the --no-eval flag) to save cost and time. Entity recall, pass rate, and safety refusal accuracy are always computed. When all metrics are needed, the evaluation is run with the full judge enabled (as in the final Research Conclusion run).
Many automated reports show 0.000 for NDCG@5, MRR, Precision@5, and Recall@5. This occurs in runs where expected_source_urls were not yet populated in the golden question set, so no retrieval quality comparison was possible. Reports that do show these metrics typically display very low values (0.017--0.055) rather than the 0.3--0.7 range expected for a well-performing retrieval system. This was a measurement artifact, not a retrieval quality issue: the golden questions defined expected_source_urls at the coarse department-page level (e.g., /cardiologie), while the RAG system correctly retrieved specific sub-pages, doctor profiles, and PDF brochures under that department. Exact URL matching penalized these correct retrievals. This was fixed by implementing URL prefix matching and graded relevance scoring (commit ad0fa06). End-to-end answer quality is better reflected by entity recall and pass rate, which are always computed.
Reports are listed chronologically. Key milestones are highlighted.
| Report | Label | Pass Rate | Significance |
|---|---|---|---|
| 2026-02-17 14:31 | v25-post-fixes | 97.9% | Baseline after root-cause fixes |
| 2026-02-17 15:44 | v251-baseline-decomposition-off | -- | Query decomposition experiment |
| 2026-02-17 16:36 | v251-baseline-decomposition-off-fixed | -- | Decomposition fix verification |
| 2026-02-17 17:25 | v251-decomposition-on | -- | Decomposition enabled comparison |
| 2026-02-18 14:01 | bge-m3-enriched-baseline | -- | BGE-M3 embedding model baseline |
| 2026-02-19 08:25 | graph-quality-fixes-v27 | -- | Graph quality iteration |
| 2026-02-19 08:26 | graph-quality-fixes-v27 | -- | Graph quality (re-run) |
| 2026-02-19 10:23 | graph-quality-fixes-v27 | -- | Graph quality (final) |
| 2026-02-19 13:00 | ab-test-graph-value | -- | A/B graph value test data |
| 2026-02-19 13:15 | graph-on-post-fix | -- | Graph ON after fixes |
| 2026-02-20 03:34 | chatbot-ux-overhaul | -- | UX changes regression check |
| 2026-02-20 04:37 | bge-m3-docs-consolidation-baseline | -- | Consolidated docs baseline |
| 2026-02-20 14:28 | baseline-all-off | -- | Ablation: all features off |
| 2026-02-20 15:12 | filco-regression-fix-validation | -- | FILCO regression fix |
| 2026-02-20 15:22 | filco-fix-full-golden-eval | -- | FILCO fix full validation |
| 2026-02-20 15:42 | crag-only | -- | Ablation: CRAG only |
| 2026-02-20 16:58 | filco-only | -- | Ablation: FILCO only |
| 2026-02-20 18:00 | guardrails-only | -- | Ablation: Guardrails only |
| 2026-02-20 19:04 | all-three-on | -- | Ablation: all features on |
| 2026-02-20 20:34 | ablation-crag-filco-guardrails | -- | Ablation summary run |
| 2026-02-21 04:47 | crag-only | -- | Ablation v2: CRAG only |
| 2026-02-21 05:00 | filco-only | -- | Ablation v2: FILCO only |
| 2026-02-21 05:35 | all-three-on | -- | Ablation v2: all on |
| 2026-02-21 06:15 | ablation-crag-filco-guardrails | -- | Ablation v2 summary |
| 2026-02-21 07:10 | crag-only | -- | Ablation v3: CRAG only |
| 2026-02-21 07:23 | all-three-on | -- | Ablation v3: all on |
| 2026-02-21 08:01 | ablation-crag-filco-guardrails | -- | Ablation v3 summary |
| 2026-02-21 13:50 | snomed-deep-integration-all-features | 91.0% | SNOMED integration (initial, with regressions) |
| 2026-02-21 16:39 | 3-root-cause-fixes | 50.0% | Targeted re-run of 16 failing questions |
| 2026-02-21 20:14 | reseeded-graph-max-speed | 100.0% | First 100% pass rate achieved |
| 2026-02-22 08:29 | c901-refactoring-verification | 100.0% | Code refactoring -- no regression |
| 2026-02-22 11:51 | ollama-docker-no-regression | -- | Ollama Docker migration check |
| 2026-02-22 12:31 | safety-judge-enabled-rerun | -- | Safety judge feature verification |
| 2026-02-22 13:11 | post-safety-fixes-full-run | 98.9% | Post safety fixes |
| 2026-02-22 22:27 | phase-c-snomed-alias-elimination | 98.9% | SNOMED alias cache (entity recall only) |
| 2026-02-23 03:23 | phase-c-golden-fix | 100.0% | Phase C with ground truth refinement |
| 2026-02-23 04:25 | graph-off-ablation | -- | Graph OFF ablation data |
| 2026-02-23 09:04 | graph-value-assessment | -- | LLM-as-judge graph comparison |
| 2026-02-23 14:41 | perf-optimized-prompt-compression-ollama-warmup | 98.3% | Latency optimization validation |
Performance Trajectory
The evaluation history shows clear quality progression through iterative improvement:
| Milestone | Date | Pass Rate | Entity Recall | Key Change |
|---|---|---|---|---|
| Initial baseline (v2.3) | 2026-02-17 | 92.6% (112/121) | 0.907 | Vector-only, 121 questions |
| Post root-cause fixes (v2.5) | 2026-02-17 | 97.9% (143/146) | 0.930 | 9 failures resolved, 25 questions added |
| Ablation: all features on | 2026-02-20 | 96.9% (158/163) | -- | CRAG + FILCO + Guardrails enabled |
| SNOMED integration | 2026-02-21 | 91.0% (162/178) | -- | 178 questions, initial regressions |
| First 100% pass | 2026-02-21 | 100.0% (178/178) | 0.942 | Reseeded graph + root-cause fixes |
| Phase C: SNOMED aliases | 2026-02-22 | 98.9% (176/178) | 0.936 | SNOMED synonym cache, 17% faster |
| Phase C: ground truth fix | 2026-02-23 | 100.0% (178/178) | 0.956 | Ground truth refinement |
| Final (GPT-5.2 judge) | 2026-02-23 | 100.0% (178/178) | 0.956 | Full LLM-judge evaluation |
| B7: Expanded golden set | 2026-02-27 | 99.2% (249/251) | 0.936 | +93 questions (271 total, v3.3) |
| Pilot deployment | 2026-03-13 | 100.0% (268/268) | 0.902 | Full pilot eval post-Keycloak/Neo4j removal |