Research Conclusion — ZOL Intelligent Search

Date: 2026-02-23 Project: Ziekenhuis Oost-Limburg (ZOL) Intelligent Search Function Context: PXL AI Technology Architect graduation project

Executive Summary

This document concludes the research phase of the ZOL Intelligent Search project. Over 330 commits, 44 Architecture Decision Records, 43 evaluation reports, and 4,161+ unit tests, we have built and validated a production-ready intelligent search system for ZOL's hospital website.

The core research question — Can RAG with a knowledge graph replace keyword search for a hospital website while maintaining medical safety? — is answered with a definitive yes, subject to the architectural insights discovered during this research.

Final Metrics (2026-02-23, GPT-5.2 Judge)

Metric	Value	Target	Status
Pass rate	100.0% (178/178)	≥ 95%	Exceeded
Faithfulness	0.989	≥ 0.90	Exceeded
Answer relevancy	0.950	≥ 0.85	Exceeded
Entity recall	0.956	≥ 0.90	Exceeded
Safety refusal accuracy	100.0%	100.0%	Met
Medical advice incidents	0	0	Met
Avg response time	10.0s	< 15s	Met

Research Journey

Phase 1: Foundation (Feb 1–10)

Objective: Build baseline RAG pipeline with semantic search.

Ingested ~1,000 ZOL brochures and web pages via custom crawlers
Implemented pgvector embeddings (BAAI/bge-m3, 1024d) with hybrid BM25 search
Built Neo4j knowledge graph with entity extraction: doctors ↔ departments ↔ conditions ↔ treatments ↔ campuses
Established evaluation framework: 178 golden questions across 20 categories
Implemented multi-layer safety: content filtering, prompt injection detection, LLM safety judge

Key decisions:

ADR-0001: Tiktoken + markdown chunking
ADR-0002: No-mocking test policy (testcontainers)
ADR-0014: Taxonomy-driven graph with LLM validation
ADR-0029: Neo4j service naming (GraphitiService → Neo4jService)

Phase 2: Graph Quality (Feb 8–14)

Objective: Fix extraction errors and establish data quality.

Nine iterations of graph quality fixes (v1–v9) addressing:

Cross-product bugs: departments linked to ALL campuses instead of correct ones
Garbage entity names: body parts and job titles parsed as doctor names
Self-referential relationships and hub concept inflation
Entity type collisions (treatment vs. service vs. examination)

Result: Database doctor score 91/100. 109 departments, 120 conditions, 528 doctors searchable.

Key innovations:

Frozen taxonomy architecture with scraper-driven entity registry
SNOMED CT integration for medical terminology normalization
Comprehensive domain knowledge maps (DEPT_CONDITION_MAP, SPECIALTY_DEPARTMENT_MAP, etc.)
LLM entity validation with cross-page caching

Phase 3: Retrieval Quality (Feb 14–21)

Objective: Optimize answer quality through retrieval pipeline improvements.

Implemented and evaluated through systematic ablation studies:

Feature	Impact on Pass Rate	Decision
CRAG (web search fallback)	+0.6%	Enabled
FILCO (context filtering)	+1.1%	Enabled
Safety guardrails	No regression	Enabled
Query decomposition	-0.6%	Disabled
Embedding upgrade (bge-m3)	Baseline established	Production ready

Key evaluation runs:

Ablation study: 8 configurations tested (baseline, each feature solo, all combinations)
Root cause analysis: Fixed FILCO regression, optimized CRAG thresholds
SNOMED integration: 15 medical terminology questions added, all passing

Phase 4: Graph Optimization (Feb 22–23)

Objective: Determine optimal knowledge graph utilization strategy.

Graph Value Assessment (90 questions, GPT-4.1 judge, randomized A/B):

Finding	Detail
Graph hurts overall quality	-0.067 average (4.609 vs 4.676)
Judge prefers Graph OFF	61% vs 23%
Graph provides critical rescue	GQ-088: +2.8 (4.8 vs 2.0)
Graph helps entity lookups	doctor_department: +0.10
Graph hurts content queries	ambiguous_symptom: -0.32

Solution: Conditional Graph Injection — gate that only injects graph context when intent or retrieval signals indicate it will help.

Final validation: 100% pass rate with premium judge (GPT-5.2), 0 regressions.

Architecture

User Query (Dutch natural language)
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Safety Layer (prompt injection detection)       │
│  → Blocks adversarial/harmful queries            │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Intent Classification (gpt-4.1-nano)            │
│  → 10 intent categories + entity extraction      │
│  → Retrieval strategy selection                  │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Parallel Retrieval                              │
│  ┌──────────────┐  ┌─────────────────────────┐  │
│  │ Vector Search │  │ Knowledge Graph Search  │  │
│  │ (pgvector +   │  │ (PostgreSQL taxonomy    │  │
│  │  BM25 hybrid) │  │  tables + SQL queries)  │  │
│  └──────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  ★ Conditional Graph Injection Gate              │
│  → Intent-based: doctor/dept → always inject     │
│  → Sparse results: < 3 vectors → inject          │
│  → Low similarity: < 0.65 → inject (rescue)      │
│  → Default: suppress graph (avoid dilution)      │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Context Assembly                                │
│  → FILCO filtering (remove irrelevant chunks)    │
│  → Token budget management                       │
│  → Citation generation                           │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  CRAG (Corrective RAG)                           │
│  → Evaluates retrieval quality                   │
│  → Web search fallback if insufficient           │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  LLM Generation (gpt-4.1-mini)                    │
│  → Grounded answer with citations                │
│  → Safety guardrails in system prompt            │
│  → Dutch medical disclaimer                      │
└─────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Post-Generation Safety                          │
│  → LLM safety judge (medical advice detection)   │
│  → Content policy enforcement                    │
│  → Citation verification                         │
└─────────────────────────────────────────────────┘

Key Research Findings

1. Knowledge Graphs Are Conditionally Valuable

The most significant finding is that knowledge graphs in RAG systems are not universally beneficial. They help for structural queries (entity lookups, relationship traversals) but hurt for content-rich queries (symptoms, treatments, conditions) where document chunks already contain comprehensive information.

Implication: Always-on graph injection is an anti-pattern. A gate based on intent classification and retrieval quality signals achieves the best of both worlds.

2. Eval Model Quality Matters

Using GPT-4.1-mini as the evaluation judge produced 5 false negatives (97.2% pass rate). GPT-5.2 correctly scored all 178 questions as passing (100.0%). The premium model is worth the cost for final validation runs.

Judge Model	Pass Rate	False Negatives	Cost per Run
gpt-4.1-mini	97.2%	5	~$0.36
gpt-5.2	100.0%	0	~$14.24

3. Medical Safety Is Achievable at Zero Incidents

Multi-layer safety (content filtering + prompt injection detection + LLM safety judge + system prompt guardrails) achieves 100% refusal accuracy across 21 safety and adversarial test cases, including:

Direct medical advice requests
Prompt injection attempts (GCG, role hijacking, system prompt extraction)
Edge cases (urgent scenarios, dosage questions)

4. Dutch Medical NLP Is Tractable

Despite the limited Dutch medical NLP ecosystem, the combination of:

SNOMED CT synonyms (400+ medical terms)
Handcrafted taxonomy (222 aliases)
LLM entity validation
Fuzzy matching fallbacks

...achieves robust entity resolution for Dutch medical queries.

5. Hybrid Search Outperforms Pure Vector

BM25 hybrid search (0.3 weight) consistently improves retrieval for:

Exact medical terms (drug names, procedure codes)
Proper nouns (doctor names, department names)
Compound Dutch words (spoedgevallendienst, kinderpsychiatrie)

What Remains (Engineering, Not Research)

The following items are production engineering tasks, not research questions:

Item	Type	Priority
Load testing and performance optimization	Engineering	High
Monitoring, alerting, and observability	Operations	High
A/B testing framework for production	Engineering	Medium
Threshold tuning with real user signals	Optimization	Medium
UI/UX refinement (chat interface)	Design	Medium
Embedding model upgrade (bge-m3 deployed)	Incremental	Low
Agentic RAG for multi-step reasoning	Future research	Deferred

Evaluation Coverage

Golden Questions: 178 across 20 categories

Category	Count	Description
condition_department	19	"Where do I go for condition X?"
multi_hop_graph	19	Multi-step entity traversals
snomed_terminology	15	SNOMED medical terms in Dutch
adversarial_gcg	12	Prompt injection attacks
out_of_scope	12	Non-ZOL and irrelevant questions
practical_info	12	Visiting hours, parking, phone
safety_refusal	9	Medical advice, dosage requests
service_info	9	"Does ZOL offer service X?"
entity_disambiguation	8	Ambiguous entity references
multilingual	8	French, English, German queries
treatment_info	8	Treatment procedures and details
campus_info	6	Campus locations and services
compound_word	6	Dutch compound word handling
doctor_department	6	Doctor ↔ department lookups
followup_chain	6	Multi-turn conversation chains
ambiguous_symptom	5	Vague symptom descriptions
navigation	5	Wayfinding and practical navigation
emergency	3	Emergency and urgent care
referral	3	Referral process questions

Evaluation Methodology

Entity recall: Rule-based metric — expected entities present in answer
Faithfulness: LLM judge — answer supported by retrieved context
Answer relevancy: LLM judge — answer addresses the question
Context precision/recall: LLM judge — retrieved context quality
Safety refusal: Deterministic — blocked queries return refusal message
A/B assessment: Randomized LLM-as-judge comparison (90 questions)

Historical Pass Rates

Date	Configuration	Pass Rate	Judge
Feb 17	v25 baseline	93.3%	gpt-4.1-mini
Feb 20	All features on	97.8%	gpt-4.1-mini
Feb 21	SNOMED integration	97.8%	gpt-4.1-mini
Feb 22	Post-safety fixes	98.9%	o4-mini
Feb 22	Phase C (SNOMED aliases)	98.9%	o4-mini
Feb 23	Conditional graph injection	100.0%	gpt-5.2

Conclusion

The ZOL Intelligent Search system is research-complete and production-ready for the core use case: replacing keyword search with an intelligent, safety-aware, grounded search function for the ZOL hospital website.

The key architectural insight — conditional graph injection — resolves the tension between knowledge graph utility and answer quality dilution, achieving a system that:

Answers structural queries (doctors, departments, campuses) with graph-sourced authority
Answers content queries (conditions, treatments, symptoms) with rich document-sourced detail
Rescues failed queries where vector search returns low-quality results
Refuses unsafe queries with 100% accuracy across all tested attack vectors
Handles Dutch medical terminology via SNOMED CT integration and fuzzy matching

The remaining work is engineering (production hardening, monitoring, A/B testing) — not research.

Addendum (March 2026)

Since the research conclusion was written, the system has been deployed to pilot at test.medchat.health. Several architectural evolutions have taken place: Neo4j was fully removed and replaced by PostgreSQL-based taxonomy tables (taxonomy_entities, taxonomy_relationships) with the HospitalTaxonomy class, authentication was migrated from custom cookie-based JWT to Keycloak OIDC, and the golden evaluation set was expanded to 271 questions (268 content + 3 cache). The latest pilot evaluation achieves 268/268 (100%) pass rate with faithfulness 0.959 and entity recall 0.902, confirming that the architectural changes preserved and improved upon the research-phase quality baselines.

Executive Summary​

Final Metrics (2026-02-23, GPT-5.2 Judge)​

Research Journey​

Phase 1: Foundation (Feb 1–10)​

Phase 2: Graph Quality (Feb 8–14)​

Phase 3: Retrieval Quality (Feb 14–21)​

Phase 4: Graph Optimization (Feb 22–23)​

Architecture​

Key Research Findings​

1. Knowledge Graphs Are Conditionally Valuable​

2. Eval Model Quality Matters​

3. Medical Safety Is Achievable at Zero Incidents​

4. Dutch Medical NLP Is Tractable​

5. Hybrid Search Outperforms Pure Vector​

What Remains (Engineering, Not Research)​

Evaluation Coverage​

Golden Questions: 178 across 20 categories​

Evaluation Methodology​

Historical Pass Rates​

Conclusion​

Addendum (March 2026)​