Evaluation Overview

This section documents the evaluation methodology, results, and analysis of the ZOL Intelligent Search system. It contains 302 golden questions across 21 categories (v3.6), 45+ automated evaluation reports, ablation studies, and safety testing results. This page provides a guided reading path through the material.

Reading Guide

The evaluation documentation is organized in three layers: methodology (how we evaluate), results (what we measured), and analysis (what it means). The table below suggests a reading order depending on your interest.

Document	What It Covers	Read If You...
Latest Eval (2026-03-31)	99.0% full run (99.7% effective with ground truth fixes) — taxonomy dedup, SNOMED gap fill, Knowledge Graph ON	Want the latest production eval results
Post-Gap-Fill (2026-03-30)	97.7% — after LLM batch classification of 1,674 orphaned entities	Want the gap-fill impact data
Pre-Dedup Baseline (2026-03-29)	98.7% — last eval before taxonomy dedup (12,997 bloated entities)	Want the pre-dedup baseline
RAG Mixin Split (2026-03-27)	99.7% — RAG mixin refactoring, dedup, all fixes	Want the previous best result
Hardened Baseline (2026-03-21)	99.0% — hardened pilot, 302 questions, GPT-4.1 evaluator	Want the previous hardened baseline
2026-03-19	Pilot v2 — full taxonomy extraction (2,213 entities)	95.9% (257/268)
2026-03-20	SNOMED expansion + taxonomy TREATS + prompt fixes	98.7% (295/299)
2026-03-21	Definitive baseline — hardened pilot, 302q v3.6	99.0% (296/299)
2026-03-27	RAG mixin split, dedup, all fixes	99.7% (298/299)
2026-03-29	PDF corpus scaling incident	98.7% (295/299)
2026-03-31	Taxonomy dedup + gap fill + intent fix + Graph ON	99.0% (296/299), effective 99.7%
Research Conclusion	Final metrics, research journey, key findings	Want the headline results and executive summary
Golden Questions	302 test questions (v3.6), scoring criteria, design methodology	Want to understand the evaluation benchmark
Composite Quality Gate	Multi-metric pass/fail criteria for LLM-as-judge evaluation	Want to understand why single-metric gating fails
Ablation Study	Feature impact analysis (CRAG, FILCO, Guardrails)	Want to see which pipeline features matter
A/B Experiment: Vector vs Hybrid	Controlled comparison of vector-only vs graph-augmented retrieval	Want evidence for/against the knowledge graph
Conditional Graph Injection	Final architecture: selective graph context inclusion	Want to understand the graph injection decision
Academic Critical Assessment	Honest gap analysis against state-of-the-art RAG systems	Want a critical perspective on limitations
Research Bibliography	60+ academic references in APA 7th edition	Need to verify citations or find source papers

Evaluation Architecture

The evaluation framework follows an automated pipeline:

Golden questions (302+ questions, 21 categories (v3.6)) define the benchmark with expected entities and pass criteria
Feedback-derived questions — real user questions from negative feedback can be added to the golden set via the "Add to golden questions" button in the AI Investigation panel
Automated runner (run_evaluation.py) sends each question to the live RAG system and captures the response
Entity recall scoring checks whether expected entities (doctors, departments, conditions, phone numbers) appear in the response using case-insensitive substring matching
LLM-as-judge metrics (optional, via DeepEval) assess faithfulness, answer relevancy, context precision, and context recall using a GPT model as evaluator, following the RAGAS framework (Es et al. 2023)
Report generation produces a structured Markdown report with per-question results, statistical analysis, and category breakdowns

Each evaluation run produces a timestamped report in the Experiment Reports section below. Labels identify the configuration or feature being tested (e.g., phase-c-snomed-alias-elimination, c901-refactoring-verification).

Key Metrics

Metric	Definition	Final Value
Pass rate	% of golden questions correctly answered (composite gate: faithfulness + entity recall + relevancy)	99.0% (296/299)^1^
Entity recall	Fraction of expected entities found in the response (case-insensitive substring)	0.932 mean
Faithfulness	LLM-judge score: is the response grounded in retrieved context? (0-1)	0.920
Answer relevancy	LLM-judge score: does the response address the question? (0-1)	0.944
Safety refusal accuracy	% of unsafe queries (medical advice, medication dosage) correctly refused	100.0%
Adversarial detection	% of GCG-style prompt injection attacks detected	100.0%
Medical advice incidents	Number of responses providing medical advice	0
Avg response time	Mean end-to-end response latency	6.3s

^1^ Pilot evaluation (2026-03-13) used v3.3 (271 golden questions): 268 content + 3 cache-test. Latest evaluations (2026-03-20 onwards) use v3.6 (302 questions) with expanded coverage for SNOMED terminology and taxonomy routing. The definitive baseline (2026-03-21) achieves 99.0% pass rate (296/299) on the hardened pilot.

Model Version History

Period	RAG Model	Eval Model	Notes
Feb 2026	GPT-4.1 via OpenRouter	GPT-4.1 mini via OpenRouter	Initial development
Mar 2026 (early)	GPT-4.1 via OpenRouter	GPT-5.2 via OpenRouter	Evaluated GPT-5.2 as eval judge
Mar 2026 (mid)	o4-mini via OpenAI direct	GPT-4.1 via OpenAI direct	OpenRouter removed for reliability
Mar 2026 (definitive)	o4-mini via OpenAI direct	GPT-4.1 via OpenAI direct	Definitive baseline (99.0%)

Methodology Documents

Golden Questions Evaluation Set

The Golden Questions document defines the 302-question benchmark (v3.6). It covers:

Design methodology: question selection criteria, category taxonomy, difficulty stratification
Scoring protocol: entity recall calculation, pass/fail thresholds, multi-entity weighted scoring
Category coverage: 21 intent categories (doctor_lookup, condition_department, practical_info, safety_refusal, multilingual, cache_test, etc.) each exercised by 3+ questions
Academic grounding: references to evaluation methodology literature (Voorhees 2002; RAGAS framework; DeepEval)

Academic Critical Assessment

The Academic Critical Assessment provides an honest evaluation against state-of-the-art RAG systems, identifying both strengths and gaps. It classifies the ZOL system as a "mature Advanced RAG with significant Modular RAG characteristics" and rates each architectural dimension on a 5-star scale.

Research Bibliography

Canonical bibliography

The canonical, machine-checkable bibliography for the project is at /docs/references (rendered from docs/references.bib). The legacy Research Bibliography page predates the canonical source and is retained for thesis cross-references and discursive context only. New citations across the documentation should land in references.bib and deep-link via /docs/references#bibkey.

The Research Bibliography contains 60+ academic references organised thematically (RAG foundations, retrieval techniques, knowledge graphs, safety, evaluation methodology). All citations follow APA 7th edition format. Where an entry has a corresponding bibkey on the canonical page, the cross-reference is annotated.

Evaluation Reports

Analysis Reports

These are hand-written reports that analyze specific experiments, root causes, or architectural decisions.

Report	Date	Key Finding
Pilot Golden Eval — Keycloak Migration	2026-03-13	268/268 (100%) on pilot after Neo4j removal + Keycloak migration
Research Conclusion	2026-02-23	100% pass rate, 0.989 faithfulness -- research question answered
Conditional Graph Injection	2026-02-23	Selective graph context achieves best-of-both-worlds
Phase C Analysis	2026-02-23	SNOMED synonyms: 17% latency reduction, no regression
Graph Value Assessment	2026-02-23	Graph ON hurts overall quality but provides critical rescue on entity lookups
Production Config Validation	2026-02-22	Minimum safety features required for production deployment
Ablation Study	2026-02-20	CRAG +0.6%, FILCO +1.1%, Guardrails neutral -- all enabled
Ablation Root Cause Analysis	2026-02-21	Per-question failure investigation across ablation configurations
A/B: Vector vs Hybrid	2026-02-17	Hybrid retrieval provides statistically significant improvement
Progress: v2.3 to v2.5	2026-02-17	Pass rate 92.6% to 97.9% through root-cause fixes

Automated Evaluation Reports

These are auto-generated reports from the evaluation runner. Each captures a snapshot of system performance at a specific point in time, tagged with a descriptive label. The reports follow a standard format: summary metrics, statistical analysis, per-category breakdown, and per-question results.

About 0.000 metrics

Some automated reports show 0.000 for faithfulness, answer relevancy, context precision, and context recall. This indicates that the LLM-as-judge evaluation modules were disabled for that run (using the --no-eval flag) to save cost and time. Entity recall, pass rate, and safety refusal accuracy are always computed. When all metrics are needed, the evaluation is run with the full judge enabled (as in the final Research Conclusion run).

About 0.000 and near-zero NDCG@5, MRR, Precision@5, Recall@5

Many automated reports show 0.000 for NDCG@5, MRR, Precision@5, and Recall@5. This occurs in runs where expected_source_urls were not yet populated in the golden question set, so no retrieval quality comparison was possible. Reports that do show these metrics typically display very low values (0.017--0.055) rather than the 0.3--0.7 range expected for a well-performing retrieval system. This was a measurement artifact, not a retrieval quality issue: the golden questions defined expected_source_urls at the coarse department-page level (e.g., /cardiologie), while the RAG system correctly retrieved specific sub-pages, doctor profiles, and PDF brochures under that department. Exact URL matching penalized these correct retrievals. This was fixed by implementing URL prefix matching and graded relevance scoring (commit ad0fa06). End-to-end answer quality is better reflected by entity recall and pass rate, which are always computed.

Reports are listed chronologically. Key milestones are highlighted.

Report	Label	Pass Rate	Significance
2026-02-17 14:31	v25-post-fixes	97.9%	Baseline after root-cause fixes
2026-02-17 15:44	v251-baseline-decomposition-off	--	Query decomposition experiment
2026-02-17 16:36	v251-baseline-decomposition-off-fixed	--	Decomposition fix verification
2026-02-17 17:25	v251-decomposition-on	--	Decomposition enabled comparison
2026-02-18 14:01	bge-m3-enriched-baseline	--	BGE-M3 embedding model baseline
2026-02-19 08:25	graph-quality-fixes-v27	--	Graph quality iteration
2026-02-19 08:26	graph-quality-fixes-v27	--	Graph quality (re-run)
2026-02-19 10:23	graph-quality-fixes-v27	--	Graph quality (final)
2026-02-19 13:00	ab-test-graph-value	--	A/B graph value test data
2026-02-19 13:15	graph-on-post-fix	--	Graph ON after fixes
2026-02-20 03:34	chatbot-ux-overhaul	--	UX changes regression check
2026-02-20 04:37	bge-m3-docs-consolidation-baseline	--	Consolidated docs baseline
2026-02-20 14:28	baseline-all-off	--	Ablation: all features off
2026-02-20 15:12	filco-regression-fix-validation	--	FILCO regression fix
2026-02-20 15:22	filco-fix-full-golden-eval	--	FILCO fix full validation
2026-02-20 15:42	crag-only	--	Ablation: CRAG only
2026-02-20 16:58	filco-only	--	Ablation: FILCO only
2026-02-20 18:00	guardrails-only	--	Ablation: Guardrails only
2026-02-20 19:04	all-three-on	--	Ablation: all features on
2026-02-20 20:34	ablation-crag-filco-guardrails	--	Ablation summary run
2026-02-21 04:47	crag-only	--	Ablation v2: CRAG only
2026-02-21 05:00	filco-only	--	Ablation v2: FILCO only
2026-02-21 05:35	all-three-on	--	Ablation v2: all on
2026-02-21 06:15	ablation-crag-filco-guardrails	--	Ablation v2 summary
2026-02-21 07:10	crag-only	--	Ablation v3: CRAG only
2026-02-21 07:23	all-three-on	--	Ablation v3: all on
2026-02-21 08:01	ablation-crag-filco-guardrails	--	Ablation v3 summary
2026-02-21 13:50	snomed-deep-integration-all-features	91.0%	SNOMED integration (initial, with regressions)
2026-02-21 16:39	3-root-cause-fixes	50.0%	Targeted re-run of 16 failing questions
2026-02-21 20:14	reseeded-graph-max-speed	100.0%	First 100% pass rate achieved
2026-02-22 08:29	c901-refactoring-verification	100.0%	Code refactoring -- no regression
2026-02-22 11:51	ollama-docker-no-regression	--	Ollama Docker migration check
2026-02-22 12:31	safety-judge-enabled-rerun	--	Safety judge feature verification
2026-02-22 13:11	post-safety-fixes-full-run	98.9%	Post safety fixes
2026-02-22 22:27	phase-c-snomed-alias-elimination	98.9%	SNOMED alias cache (entity recall only)
2026-02-23 03:23	phase-c-golden-fix	100.0%	Phase C with ground truth refinement
2026-02-23 04:25	graph-off-ablation	--	Graph OFF ablation data
2026-02-23 09:04	graph-value-assessment	--	LLM-as-judge graph comparison
2026-02-23 14:41	perf-optimized-prompt-compression-ollama-warmup	98.3%	Latency optimization validation

Performance Trajectory

The evaluation history shows clear quality progression through iterative improvement:

Milestone	Date	Pass Rate	Entity Recall	Key Change
Initial baseline (v2.3)	2026-02-17	92.6% (112/121)	0.907	Vector-only, 121 questions
Post root-cause fixes (v2.5)	2026-02-17	97.9% (143/146)	0.930	9 failures resolved, 25 questions added
Ablation: all features on	2026-02-20	96.9% (158/163)	--	CRAG + FILCO + Guardrails enabled
SNOMED integration	2026-02-21	91.0% (162/178)	--	178 questions, initial regressions
First 100% pass	2026-02-21	100.0% (178/178)	0.942	Reseeded graph + root-cause fixes
Phase C: SNOMED aliases	2026-02-22	98.9% (176/178)	0.936	SNOMED synonym cache, 17% faster
Phase C: ground truth fix	2026-02-23	100.0% (178/178)	0.956	Ground truth refinement
Final (GPT-5.2 judge)	2026-02-23	100.0% (178/178)	0.956	Full LLM-judge evaluation
B7: Expanded golden set	2026-02-27	99.2% (249/251)	0.936	+93 questions (271 total, v3.3)
Pilot deployment	2026-03-13	100.0% (268/268)	0.902	Full pilot eval post-Keycloak/Neo4j removal

Evaluation Overview

Reading Guide

Quick Navigation

Suggested Reading Paths

Evaluation Architecture

Key Metrics

Model Version History

Methodology Documents

Golden Questions Evaluation Set

Academic Critical Assessment

Research Bibliography

Evaluation Reports

Analysis Reports

Automated Evaluation Reports

Performance Trajectory

Reading Guide​

Quick Navigation​

Suggested Reading Paths​

Evaluation Architecture​

Key Metrics​

Model Version History​

Methodology Documents​

Golden Questions Evaluation Set​

Academic Critical Assessment​

Research Bibliography​

Evaluation Reports​

Analysis Reports​

Automated Evaluation Reports​

Performance Trajectory​

Reading Guide

Quick Navigation

Suggested Reading Paths

Evaluation Architecture

Key Metrics

Model Version History

Methodology Documents

Golden Questions Evaluation Set

Academic Critical Assessment

Research Bibliography

Evaluation Reports

Analysis Reports

Automated Evaluation Reports

Performance Trajectory