Skip to main content

Evaluation Overview

This section documents the evaluation methodology, results, and analysis of the ZOL Intelligent Search system. It contains 302 golden questions across 21 categories (v3.6), 45+ automated evaluation reports, ablation studies, and safety testing results. This page provides a guided reading path through the material.


Reading Guide

The evaluation documentation is organized in three layers: methodology (how we evaluate), results (what we measured), and analysis (what it means). The table below suggests a reading order depending on your interest.

Quick Navigation

DocumentWhat It CoversRead If You...
Latest Eval (2026-03-31)99.0% full run (99.7% effective with ground truth fixes) — taxonomy dedup, SNOMED gap fill, Knowledge Graph ONWant the latest production eval results
Post-Gap-Fill (2026-03-30)97.7% — after LLM batch classification of 1,674 orphaned entitiesWant the gap-fill impact data
Pre-Dedup Baseline (2026-03-29)98.7% — last eval before taxonomy dedup (12,997 bloated entities)Want the pre-dedup baseline
RAG Mixin Split (2026-03-27)99.7% — RAG mixin refactoring, dedup, all fixesWant the previous best result
Hardened Baseline (2026-03-21)99.0% — hardened pilot, 302 questions, GPT-4.1 evaluatorWant the previous hardened baseline
2026-03-19Pilot v2 — full taxonomy extraction (2,213 entities)95.9% (257/268)
2026-03-20SNOMED expansion + taxonomy TREATS + prompt fixes98.7% (295/299)
2026-03-21Definitive baseline — hardened pilot, 302q v3.699.0% (296/299)
2026-03-27RAG mixin split, dedup, all fixes99.7% (298/299)
2026-03-29PDF corpus scaling incident98.7% (295/299)
2026-03-31Taxonomy dedup + gap fill + intent fix + Graph ON99.0% (296/299), effective 99.7%
Research ConclusionFinal metrics, research journey, key findingsWant the headline results and executive summary
Golden Questions302 test questions (v3.6), scoring criteria, design methodologyWant to understand the evaluation benchmark
Composite Quality GateMulti-metric pass/fail criteria for LLM-as-judge evaluationWant to understand why single-metric gating fails
Ablation StudyFeature impact analysis (CRAG, FILCO, Guardrails)Want to see which pipeline features matter
A/B Experiment: Vector vs HybridControlled comparison of vector-only vs graph-augmented retrievalWant evidence for/against the knowledge graph
Conditional Graph InjectionFinal architecture: selective graph context inclusionWant to understand the graph injection decision
Academic Critical AssessmentHonest gap analysis against state-of-the-art RAG systemsWant a critical perspective on limitations
Research Bibliography60+ academic references in APA 7th editionNeed to verify citations or find source papers

Suggested Reading Paths

For jury members / assessors: Research Conclusion, then Golden Questions (sections 1-3), then Academic Critical Assessment.

For technical reviewers: Golden Questions (full), then Ablation Study, then A/B Experiment, then Conditional Graph Injection.

For safety auditors: Golden Questions (section on safety categories), then Production Configuration Validation, then Research Conclusion (safety section).


Evaluation Architecture

The evaluation framework follows an automated pipeline:

  1. Golden questions (302+ questions, 21 categories (v3.6)) define the benchmark with expected entities and pass criteria
  2. Feedback-derived questions — real user questions from negative feedback can be added to the golden set via the "Add to golden questions" button in the AI Investigation panel
  3. Automated runner (run_evaluation.py) sends each question to the live RAG system and captures the response
  4. Entity recall scoring checks whether expected entities (doctors, departments, conditions, phone numbers) appear in the response using case-insensitive substring matching
  5. LLM-as-judge metrics (optional, via DeepEval) assess faithfulness, answer relevancy, context precision, and context recall using a GPT model as evaluator, following the RAGAS framework (Es et al. 2023)
  6. Report generation produces a structured Markdown report with per-question results, statistical analysis, and category breakdowns

Each evaluation run produces a timestamped report in the Experiment Reports section below. Labels identify the configuration or feature being tested (e.g., phase-c-snomed-alias-elimination, c901-refactoring-verification).


Key Metrics

MetricDefinitionFinal Value
Pass rate% of golden questions correctly answered (composite gate: faithfulness + entity recall + relevancy)99.0% (296/299)^1^
Entity recallFraction of expected entities found in the response (case-insensitive substring)0.932 mean
FaithfulnessLLM-judge score: is the response grounded in retrieved context? (0-1)0.920
Answer relevancyLLM-judge score: does the response address the question? (0-1)0.944
Safety refusal accuracy% of unsafe queries (medical advice, medication dosage) correctly refused100.0%
Adversarial detection% of GCG-style prompt injection attacks detected100.0%
Medical advice incidentsNumber of responses providing medical advice0
Avg response timeMean end-to-end response latency6.3s

^1^ Pilot evaluation (2026-03-13) used v3.3 (271 golden questions): 268 content + 3 cache-test. Latest evaluations (2026-03-20 onwards) use v3.6 (302 questions) with expanded coverage for SNOMED terminology and taxonomy routing. The definitive baseline (2026-03-21) achieves 99.0% pass rate (296/299) on the hardened pilot.

Model Version History

PeriodRAG ModelEval ModelNotes
Feb 2026GPT-4.1 via OpenRouterGPT-4.1 mini via OpenRouterInitial development
Mar 2026 (early)GPT-4.1 via OpenRouterGPT-5.2 via OpenRouterEvaluated GPT-5.2 as eval judge
Mar 2026 (mid)o4-mini via OpenAI directGPT-4.1 via OpenAI directOpenRouter removed for reliability
Mar 2026 (definitive)o4-mini via OpenAI directGPT-4.1 via OpenAI directDefinitive baseline (99.0%)

Methodology Documents

Golden Questions Evaluation Set

The Golden Questions document defines the 302-question benchmark (v3.6). It covers:

  • Design methodology: question selection criteria, category taxonomy, difficulty stratification
  • Scoring protocol: entity recall calculation, pass/fail thresholds, multi-entity weighted scoring
  • Category coverage: 21 intent categories (doctor_lookup, condition_department, practical_info, safety_refusal, multilingual, cache_test, etc.) each exercised by 3+ questions
  • Academic grounding: references to evaluation methodology literature (Voorhees 2002; RAGAS framework; DeepEval)

Academic Critical Assessment

The Academic Critical Assessment provides an honest evaluation against state-of-the-art RAG systems, identifying both strengths and gaps. It classifies the ZOL system as a "mature Advanced RAG with significant Modular RAG characteristics" and rates each architectural dimension on a 5-star scale.

Research Bibliography

Canonical bibliography

The canonical, machine-checkable bibliography for the project is at /docs/references (rendered from docs/references.bib). The legacy Research Bibliography page predates the canonical source and is retained for thesis cross-references and discursive context only. New citations across the documentation should land in references.bib and deep-link via /docs/references#bibkey.

The Research Bibliography contains 60+ academic references organised thematically (RAG foundations, retrieval techniques, knowledge graphs, safety, evaluation methodology). All citations follow APA 7th edition format. Where an entry has a corresponding bibkey on the canonical page, the cross-reference is annotated.


Evaluation Reports

Analysis Reports

These are hand-written reports that analyze specific experiments, root causes, or architectural decisions.

ReportDateKey Finding
Pilot Golden Eval — Keycloak Migration2026-03-13268/268 (100%) on pilot after Neo4j removal + Keycloak migration
Research Conclusion2026-02-23100% pass rate, 0.989 faithfulness -- research question answered
Conditional Graph Injection2026-02-23Selective graph context achieves best-of-both-worlds
Phase C Analysis2026-02-23SNOMED synonyms: 17% latency reduction, no regression
Graph Value Assessment2026-02-23Graph ON hurts overall quality but provides critical rescue on entity lookups
Production Config Validation2026-02-22Minimum safety features required for production deployment
Ablation Study2026-02-20CRAG +0.6%, FILCO +1.1%, Guardrails neutral -- all enabled
Ablation Root Cause Analysis2026-02-21Per-question failure investigation across ablation configurations
A/B: Vector vs Hybrid2026-02-17Hybrid retrieval provides statistically significant improvement
Progress: v2.3 to v2.52026-02-17Pass rate 92.6% to 97.9% through root-cause fixes

Automated Evaluation Reports

These are auto-generated reports from the evaluation runner. Each captures a snapshot of system performance at a specific point in time, tagged with a descriptive label. The reports follow a standard format: summary metrics, statistical analysis, per-category breakdown, and per-question results.

About 0.000 metrics

Some automated reports show 0.000 for faithfulness, answer relevancy, context precision, and context recall. This indicates that the LLM-as-judge evaluation modules were disabled for that run (using the --no-eval flag) to save cost and time. Entity recall, pass rate, and safety refusal accuracy are always computed. When all metrics are needed, the evaluation is run with the full judge enabled (as in the final Research Conclusion run).

About 0.000 and near-zero NDCG@5, MRR, Precision@5, Recall@5

Many automated reports show 0.000 for NDCG@5, MRR, Precision@5, and Recall@5. This occurs in runs where expected_source_urls were not yet populated in the golden question set, so no retrieval quality comparison was possible. Reports that do show these metrics typically display very low values (0.017--0.055) rather than the 0.3--0.7 range expected for a well-performing retrieval system. This was a measurement artifact, not a retrieval quality issue: the golden questions defined expected_source_urls at the coarse department-page level (e.g., /cardiologie), while the RAG system correctly retrieved specific sub-pages, doctor profiles, and PDF brochures under that department. Exact URL matching penalized these correct retrievals. This was fixed by implementing URL prefix matching and graded relevance scoring (commit ad0fa06). End-to-end answer quality is better reflected by entity recall and pass rate, which are always computed.

Reports are listed chronologically. Key milestones are highlighted.

ReportLabelPass RateSignificance
2026-02-17 14:31v25-post-fixes97.9%Baseline after root-cause fixes
2026-02-17 15:44v251-baseline-decomposition-off--Query decomposition experiment
2026-02-17 16:36v251-baseline-decomposition-off-fixed--Decomposition fix verification
2026-02-17 17:25v251-decomposition-on--Decomposition enabled comparison
2026-02-18 14:01bge-m3-enriched-baseline--BGE-M3 embedding model baseline
2026-02-19 08:25graph-quality-fixes-v27--Graph quality iteration
2026-02-19 08:26graph-quality-fixes-v27--Graph quality (re-run)
2026-02-19 10:23graph-quality-fixes-v27--Graph quality (final)
2026-02-19 13:00ab-test-graph-value--A/B graph value test data
2026-02-19 13:15graph-on-post-fix--Graph ON after fixes
2026-02-20 03:34chatbot-ux-overhaul--UX changes regression check
2026-02-20 04:37bge-m3-docs-consolidation-baseline--Consolidated docs baseline
2026-02-20 14:28baseline-all-off--Ablation: all features off
2026-02-20 15:12filco-regression-fix-validation--FILCO regression fix
2026-02-20 15:22filco-fix-full-golden-eval--FILCO fix full validation
2026-02-20 15:42crag-only--Ablation: CRAG only
2026-02-20 16:58filco-only--Ablation: FILCO only
2026-02-20 18:00guardrails-only--Ablation: Guardrails only
2026-02-20 19:04all-three-on--Ablation: all features on
2026-02-20 20:34ablation-crag-filco-guardrails--Ablation summary run
2026-02-21 04:47crag-only--Ablation v2: CRAG only
2026-02-21 05:00filco-only--Ablation v2: FILCO only
2026-02-21 05:35all-three-on--Ablation v2: all on
2026-02-21 06:15ablation-crag-filco-guardrails--Ablation v2 summary
2026-02-21 07:10crag-only--Ablation v3: CRAG only
2026-02-21 07:23all-three-on--Ablation v3: all on
2026-02-21 08:01ablation-crag-filco-guardrails--Ablation v3 summary
2026-02-21 13:50snomed-deep-integration-all-features91.0%SNOMED integration (initial, with regressions)
2026-02-21 16:393-root-cause-fixes50.0%Targeted re-run of 16 failing questions
2026-02-21 20:14reseeded-graph-max-speed100.0%First 100% pass rate achieved
2026-02-22 08:29c901-refactoring-verification100.0%Code refactoring -- no regression
2026-02-22 11:51ollama-docker-no-regression--Ollama Docker migration check
2026-02-22 12:31safety-judge-enabled-rerun--Safety judge feature verification
2026-02-22 13:11post-safety-fixes-full-run98.9%Post safety fixes
2026-02-22 22:27phase-c-snomed-alias-elimination98.9%SNOMED alias cache (entity recall only)
2026-02-23 03:23phase-c-golden-fix100.0%Phase C with ground truth refinement
2026-02-23 04:25graph-off-ablation--Graph OFF ablation data
2026-02-23 09:04graph-value-assessment--LLM-as-judge graph comparison
2026-02-23 14:41perf-optimized-prompt-compression-ollama-warmup98.3%Latency optimization validation

Performance Trajectory

The evaluation history shows clear quality progression through iterative improvement:

MilestoneDatePass RateEntity RecallKey Change
Initial baseline (v2.3)2026-02-1792.6% (112/121)0.907Vector-only, 121 questions
Post root-cause fixes (v2.5)2026-02-1797.9% (143/146)0.9309 failures resolved, 25 questions added
Ablation: all features on2026-02-2096.9% (158/163)--CRAG + FILCO + Guardrails enabled
SNOMED integration2026-02-2191.0% (162/178)--178 questions, initial regressions
First 100% pass2026-02-21100.0% (178/178)0.942Reseeded graph + root-cause fixes
Phase C: SNOMED aliases2026-02-2298.9% (176/178)0.936SNOMED synonym cache, 17% faster
Phase C: ground truth fix2026-02-23100.0% (178/178)0.956Ground truth refinement
Final (GPT-5.2 judge)2026-02-23100.0% (178/178)0.956Full LLM-judge evaluation
B7: Expanded golden set2026-02-2799.2% (249/251)0.936+93 questions (271 total, v3.3)
Pilot deployment2026-03-13100.0% (268/268)0.902Full pilot eval post-Keycloak/Neo4j removal